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RELATION BETWEEN STATURE AND BLOOD GROUP 
AMONG INDIAN SOLDIERS 


By N. T. MATHEW 
Army Headquarters, India 


SUMMARY. This paper analyses blood group data relating to 4543 soldiers of the Indian army 
surveyed in 1952-53. It is seen that there are significant differences in stature among the different blood 
groups. Group B is the tallest and group A the shortest. These differences persist even when the data 
are considered separately for differont states and communities. Analysis of the gene frequencies in dif- 


feront states and communities reveal certain interesting group affinities. 


1. INTRODUCTION 


An important source of interest in the study of blood group frequencies is 
the light that they throw on anthropological differences. Anthropometrists had been 
using in their work measurements of body dimensions for a long time even before 
they started analysing blood groups. But no attempt seems to have been made to 
correlate blood groups and body measurements. Some indications that blood groups 
are related to physical traits such as proneness to contract certain diseases are referred 
to by Mourant (1954). The data used in the present paper reveal significant variation 
in’ stature among persons of different blood groups. This does not appear to have 
been noticed before. . 


2, THE SAMPLE 
The present data relate to 4543 soldiers of the Indian Army who formed part 
of a somewhat larger sample of soldiers selected for a survey of body measurements 
carried out in 1952-53. The primary object of the survey was the collection of data 
for standardization of clothing sizes. A medical officer, Capt. D. N. Bhattacharya, 
who was in charge of the field work found time for blood group determinations while 
his measuring team was busy on the body measurements. 


Soldiers of the Indian Army cannot be regarded as а random sample of the 
де forward for recruitment 


belong in varying proportions to different economie, social and regional strata. The 
ards of height, weight 


actual recruits are further selected to conform to certain stand 
and other physical characteristics. 


The 4543 soldiers considered here do not constitute a random sample of 
at the time of survey in Delhi and some 


general population of the country. The volunteers, who com 


soldiers. They were chosen from units located 
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other stations, so as to obtain 200 soldiers from each of a number of "army classes 
which had to be studied separately for clothing sizes. 


However, these considerations may not affect conclusions about blood group 


frequencies as blood group did not influence the selection in any way. 

The ‘states’ referred to in this paper are the pre-reorganization states which 
existed in India at the time of the survey plus Nepal which is outside India. The 
communities are either tribes (e.g. Adibasis, Ahirs) or linguistic-territorial groups 
(e.g. Bengalees, Biharis, Tamilians) or religious groups (Muslims, Sikhs. Christians). 
None from these three religious groups are included in any of the tribal or linguistic 
or territorial groups. Statements made by the subjects at the time of the survey 
form the basis of grouping. “Sikhs (M & Ry stand for Mazhabi and Ramdasia 
Sikhs who are supposed to have belonged originally to low caste Hindus. ‘Syrian 
Christians’ are an indigenous group whose connection with Syria is not racial. 


3. DIFFERENCES IN STATURE 
ntion to differences among the 


average values of height in persons belonging to the four ABO blood groups. In 
Table 1 we give the analysis of variance of height between and within blood groups. 


The main object of this paper is to invite atte 


TABLE 1. ANALYSIS OF VARIANCE OF HEIGHT (em?) 


_——— 


source of d.f. 8.8. m.s. F 
variation 
between blood groups 3 467 155.6 4.37 
within blood groups 4539 161607 35.6 
total 4542 162074 Ба 


- a Р : 

The ratio of variances which comes out as 4.37 exceeds the one per cent level of 
ee т : Б. 5 

significance. The mean height for each blood group is given in Table 2. 


TABLE 2. MEANS AND STANDARD ERRORS OF HEIGHT IN CMS 


blood groups O A B AB total 
number of observations 1480 1242 1406 415 4543 
“mean 167.6 167.2 168.0 167.3 167.5 
standard error of mean 0.16 0.17 0.16 0.29 0.09 


Tt would appear that the “В” group is taller than the other phenotypes. Second in 
order of height comes ‘O’, third is ‘AB’ and the shortest is ‘A’. 


DA The significance of the variance ratio of Table 1 may possibly be due to the 
ota ШӨ being a mixture of individuals from different parts of India with different 
proportions of the O, A, B, AB phenotypic frequencies and different average heights. 


The indivisetis are classified by state and communities for a closer study in the 
following section. 
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Similar analysis for weight as well as blood pressure did not reveal any signi- 
ficant differences. The averages of age for the four groups were also found to be 
nearly equal, being 26.7, 26.6, 26.6 and 26.7 respectively for O, A, B and AB. Even 
for height the magnitude of the difference is so small that significance could not have 
been achieved in smaller series of observations. This probably explains why such 
differences were not noticed before. 

Analysis of variance was carried out separately for the twenty states and 
thirtytwo communities considered in this paper. The total numbers in these states 
and communities vary from 4 to 970. Only in one case did the variance ratio prove 


significant. 


4. STATES AND COMMUNITIES 


The effect, if any, of blood group on height must be regarded as superimposed 
on the effect of environmental and racial differences. These two latter factors may 
to some extent be reflected in the differences between states (Table 3) and commu- 


nities (Table 4). 


TABLE 3. AVERAGE HEIGHT BY BLOOD GROUPS AND STATES 


frequeney average height in em 
stato total total 
О А B AB о А B AB 
a) (2) (3 (4) (5 (6) (7) (8) (9) (10) (11) 

Assam 78 84 42 5 209 160.8 — 161.1 — 161.7 160.9 161.1 
Bihar 61 84 82 42 269 167.1 166.8 166.3 166.2 — 166.6 
Bombay 116 104 117 36 373 165.7 165.8 1060.0 166.1 165.8 
борта 5 1 4 2 1m 167.0 166.5 167.9 170.5 167.8 
Delhi 8 6 В s 2 170.5 - 173.3 
Himachal Pradesh 7 9 8 ә әб 168.2 171.5 169.3 
Hyderabad 5 2 3 1 T 166.1 160.5 165.6 
Jammu and Kashmir 67 83 97 28 275 168.5 169.0 168.7 
Madhya Bharat = 5 9 167 166.0 168.2 
Madhya Pradesh 17 19 7 63 167.6 165.0 166.5 
Madras 109 171 34 543 167.5 167.5 167.2 
Mysore 3 2 з 14 163.8 171.0 166.1 
Orises. 5 1 з 1 10 100.8 158.5 105.8 , 160.5 164.9 
PEPSU 62 41 51 21 175 171.7 169.6 71.4 169.5 170.8 
Punjab 291 262 329 38 970 170.1 170.3 5 169.3 170.2 
Rajasthan 110 54 80 14 258 170.5 170.5 171.1 169.5 170.6 
Travancore-Cochin 86 56 50 7 199 167.3 165.5 166.9 165.4 166.6 
Uttar Pradesh 300 217 240 100 757 167.2 166.3 167.1 166.1 166.8 
West Bengal 50 48 49 11 158 167.3 167.1 167.4 166.0 167.2 
Nepal 71 6l 46 12 190 161.0 162.3 162.2 163.0 161.9 
total 1480 1242 1406 415 4543 167.6 168.0 167.3 167.5 
number of group averages observed 9 7 T 7 

greater than the state averages } expected 10 98 10 9. 
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It will be seen from Table 3 that among states the average height varies by 
12.2 ems from 161.1 cms in Assam to 173.3 ems in Delhi. Yet in 16 out of 20 states 
the average height of B is greater than the general average for the state. Due to 
chance only 10 out of 20 states can be expected to have B taller than the average. 
The difference between the numbers observed and expected can be seen to be statisti- 
cally significant. 


TABLE 4. 


AVERAGE HEIGHT BY BLOOD GROUPS AND COMMUNITIES 


frequency average height in : 
community total — total 
0 A. B AB 0 A B 

Adibasis (Bihar) әт 37 41 21 126 164.7 164.1 
Adibasis (Other) в 12 10 2 32 163.8 164.4 
Ahirs 77 62 74 24 237 171.4 171.5 
Andhras в 21 29 т 93 164.8 166.3 
Assamese 76 38 81 5 200 160.5 161.0 160.8 160.9 160.8 
Balmikis 4 6 MN 4 30 163.4 168.8 166.7 159.0 165.6 
Bengalees 53 53 57 12 15 168.0 166.9 167.5 165.: 167.3 
Biharis 22 29 27 18 96 1711 170.0 169.1 — 170.4 170.3 
Christians (Syrian) "sob uea 
Christians (Tamil) 19 13 239 6 67 187.3 ^ 168.7 166.7 104.1 167.0 
Christians (Other) 5 1 5 = № 163.0 170.5 162.9 - 163.7 
Coorgs 5 4 uin 167.0 166.5 167.9 172.0 167.7 
Dogras 59 97 74 38 268 168.1 168.2 168.8 168.4 168.4 
Garhwalis 40 75 57 24 196 162.5 163.0 163.9 163.0 163.2 
Gurkhas 73 64 49 14 200 160.9 162.4 162.4 162.9 161.9 
Gujjars 62 48 79 11 200 170.8 170.0 171.0 171.0 170.7 
Hindus (U.P.) 17 19 23 19 78 167.8 170.2 165.1 167.4 107.5 
Jats 51 31 46 9 137 171.5 11.7 171.7 173.6 171.8 
Jammu Hindus 50 55 77 18 200 168.4 168.2 168.1 169.5 168.3 
Kanarese 3 2 1 2 8 111.7 166.0 171.5 170.8 170.0 
Kumaonis 50 64 57 29 200 166.7 166.6 165.7 165.2 166.2 
Lingayats б & 4 & B 165.4 167.6 167.8 169.5 166.9 
Mahars 64 51 64 21 200 164.4 163.9 164.6 165.8 164.5 
Marathas 63 64 63 18 208 167.0 167.5 167.1 165.9 167.1 
Malayalees 105 69 53 13 240 167.2 166.3 167.4 167.0 167.0 
Muslims (U.P.) 15 9 230 4 48 166.0 166.2 169.7 165.0 167.5 
Oriyas в = 2 1 4 163.0 - 164.3 160.5 103.0 
Punjabis 38 33 39 14 124 168.9 170.7 170.5 168.5 169.8 
Rajputs 126 67 105 22 320 170.0 170.9 170.0 168.8 170.1 
Sikhs (M & R) 66 52 67 17 202 167.6 167.6 168.5 167.0 167.9 
Sikhs (Other) 100 59 88 21 268 192.6 . 103.1 173.0 HL 172.7 
Tamilians 120 42 84 16 262 167.6 166.7 167.3 168.1 167.4 
total 1480 1242 1406 415 4543 167.6 167.2 168.0 167.3 167.5 
number of group averages greater } observed 13.5 12 22 14 
than the community average expected 16 15.5 16 15.5 


Similarly it can be seen from Table 4 that the average height varies from 


160.8 among Assamese to 172.7 among Sikhs (Other). But here also against an 
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ally 22 communities in which the 


expected number of 16 communities we have actu 
e can be seen to be significant at 


‘B’ group is taller than the average. The differenc 
the 5 per cent level. 

From the evidence considered above, it seems likely that there are significant 
though small differences in stature associated with blood groups. Group ‘В’ has 
the highest average stature and probably ‘A’ the lowest. It is not easy to explain 
why this should be so. This is unlikely to be the result of intermixture with a tall 
race which came into India bringing with it also a high percentage of the B gene. 1t 
is found that some of the primitive tribes of India have high proportion of B. Though 
it is true that the races of Central Asia have comparatively higher frequency of B, 
we have no evidence that they were also tall. The ‘Aryans’ are believed to have 
come into India from the direction of Persia. But the Aryans probably had a low 


frequency of B as is the case with present day populations in some Western European 


countries. 
The only tenable theory would be to regard a contribution to stature as the 
Height is known to be the result 


effect of the B gene itself or some closely linked gene. 
are linked to the B gene. 


of a large number of genes. Perhaps one or two of these 


5. GENE FREQUENCIES 
In Tables 5 and 6 gene frequencies estimated by Bernstein's method are given 
vectively for the states and for the communities. Only such states and commu- 
25 individuals in our sample. Charts 


cies in Tables 5 and 6 by 


nities are shown as are represented by more than 
1 and 2 give graphical representations of the gene frequen 
means of trilinear coordinates. 

TABLE 5. DISTRIBUTION OF BLOOD GROUPS BY STATES 


frequency of phenotype gene percentage 


gens total х2 

o A B AB p q T 19. 
авза 78 м 42 5 209 A ва В 
piney 61 st  s2 42 269 96.91 26.41 46.67 0.71 
Bombay 166 104 117 36 373 sony! amm, BS КОШО 
Himachal Pradesh 7 9 8 2 26 24.35 21.82 53.81 0.37 
Jammu and Kashmir 67 83 97 28 275 22.97 26.37 50.65 1.54 
Madhya Pradesh 20 17 19 1 63 21.21 33.24 55.54 0.18 
Madras 229 109 171 34 543 14.16 21.08 64.76 0.12 
Nepal 71 61 46 12 190 21.60 16.71 61.69 0.33 
PEFSU 62 41 51 91 175 19.40 23.00 57.58 3.03 
Punjab 291 9262 329 ss 970 20.12 24.58 55.30 1.10 
Rajasthan по 54 во 14 258 14.20 90.29 65.50 0.08 
Travancore-Cochin 86 56 50 7 199 17.45 15.63 66.91 1.99 
Uttar Pradesh 200 217 240 100 757 23.65 25.66 50.66 1.27 
West Bengal 50 47 49 12 158 20.97 21.78 57.24 0.67 
other states 32 14 25 7 78 14.38 23.00 62.60 1.03 
total 1480 1242 1406 415 4543 20.30 22.60 51.11 0.00 


**significant at 196 level 
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TABLE 6. DISTRIBUTION OF BLOOD GROUPS BY COMMUNITIES 
frequency of phenotype gene percentage 
community total — x2 

O А B AB p q r 1 d.f. 
Adibasis (Bihar) 27 37 41 21 126 28.51 45.16 0.45 
Adibasis (Other) 8 12 10 2 32 21.37 53.08 1.12 
Ahirs 77 62 74 24 237 56. 
Andhras 36 3] 20 7 f 16.37 21.68 61.95 
Assamese 76 81 38 5 200 24.80 11.54 63.04 5.52* 
Balmikis 4 5 17 4 30 16.49 45.66 37.84 0.14 
Bengalees 53 52 57 13 175 20.89 22.1: 56.37 1.32 
Biharis 22 29 27 18 96 28.05 26.64 45.28 1.82 
Christians (Syrian) 37 22 23 2 84 15.64 16.35 68.01 1.76 
Christians (Tamil) 19 13 29 6 67 15.40 30.97 53.64 0.04 
Dogras 59 97 74 38 268 29.53 23.08 46.79 0.02 
Garhwalis 40 75 57 24 196 29.91 23.61 46.47 0.93 
Gujjars 62 48 79 11 200 16.23 26.15 57.60 3.48 
Gurkhas 7 64 49 14 200 21.95 17.27 60.78 0.15 
Hindus (U.P.) 18 19 23 19 79 26.94 30.42 12.51 5.80* 
Jammu Hindus 49 55 77 18 199 20.66 28.02 51.31 1.99 
Jats 52 31 46 9 138 15.77 29.50 61.73 0.10 
Kumaonis 49 64 57 20 199 26.84 24.40 48.67 0.57 
Marathas 63 64 63 18 208 232.27 21.96 55.76 0.46 
Mahars 64 51 64 31 200 19.93 24.08 0.30 
Malayalees 105 69 53 13 - 240 18.88 14.86 66.26 0.02 
Muslims (U.P.) 15 9 20 4 48 14.62 29.32 56.06 0.00 
Punjabis 38 33 39 14 124 21.10 24.99 54.67 0.94 
Rajputs 126 67 105 22 320 15.02 22.33 62.65 0.01 
Sikhs (М & В) 66 52 67 17 202 18.90 23.62 57.48 0.11 
Sikhs (other) 100 59 88 21 268 16,22 22.94 60.84 0.10 
Tamilians 120 42 84 16 262 11.72 21.28 67.00 0.96 
other communities 22 9 16 5 52 14.34 63.12 1.22 
total 1480 1242 1406 415 4543 20.30 57.11 0.00 


*significant at 5% level 


The last column in Tables 5 and 6 gives value of № (one degree of freedom) 
for testing the agreement between the observed numbers of phenotypes, and the 
expected numbers calculated from the estimated values of р, Ч and т. The agreement 
is satisfactory except for Assamese and for U.P. Hindus. 
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Chart |. Blood group gene frequencies in different states. 
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Chart 2. Blood group gene frequencies in different communities. 
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It is seen from Chart 1 that the populations of the States of Jammu and 
Kashmir, Punjab, PEPSU, Himachal Pradesh, Uttar Pradesh, West Bengal, Bombay 
and perhaps Bihar form a homogeneous group. Geographically, this group of states 
forms a fairly compact region stretching across North and Central India. Assam, 
Nepal and Travancore-Cochin are distinct from this group and from each other but 
all are on the side of low ‘B’ gene frequency. Rajasthan and Madras are surprisingly 
close to each other. 


Chart 2 shows that some of the primitive tribes of India are comparatively 
rich in ‘В’ genes. Balmikis constitute a notable illust ration with the highest percent- 
age of `В’ genes and lowest О. The lowest frequency of B is among Assamese, Gorkhas 
and Malayalees. The Malayalee Hindus and Malayalee Syrian Christians appear 
to be racially close to each other, whereas the distance between the Tamil Hindus 
and Tamil Christians is considerable. The U.P. Muslims and U.P. Hindus also seem 


to be distinct though the percentage of the ‘В’ genes in both groups is nearly same. 


Detailed figures of blood group is given in Table 7. We have not calculated 
gene frequencies from the detailed figures in Table 7 as the total numbers are small 
in most cases. 


TABLE 7. DISTRIBUTION OF SOLDIERS ACCORDING TO BLOOD GROUP BY COMMU- 
NITIES WITHIN EACH STATE 


frequencies 
state community total 

о А B AB 
(1) (2) (3) (4) (5) (6) (7) 
Assam Assamese 76 81 38 5 200 
Bengalees 2 3 4 - 9 
total 78 84 42 5 209 
Bihar Adibasis (Bihar) 27 37 41 21 126 
Adibasis (Other) 5 11 9 2 27 

Ahirs 3 3 2 - 
Bengalees 1 1 1 1 4 
Biharis 22 29 27 (38 96 
Gurkhas . = E 1 - 1 
Rajputs 2 3 1 - 6 
Sikhs (Other) 1 = - - 1 
total 61 84 82 42 269 
Bombay Balmikis = E 1 v" 1 
Kanarese = 1 = = 1 
Lingayats 6 4 4 9 16 
Mahars 54 А 41 57 19 171 
Marathas 55 56 55 15 181 
Punjabis = 1 x m 1 

Rajputs = 1 = E 

Sikhs (Other) 1 = = = * 
hy total 116 104 117 36 373 


RELATION BETWEEN STATURE AND BLOOD GROUP 


TABLE 7. DISTRIBUTION OF SOLDIERS ACCORDING TO BLOOD GROUP BY COMMUNI- 
TIES WITHIN EACH STATE (Continued) 


О 


frequencies 
state community total 
о А B AB 
(1) (2) (3) (4) (5) (6) (7) 
Coorg Coorgs 5 1 4 1 11 
Kanareso - - 1 1 
total 5 1 2 12 
Delhi Ahirs 1 2 2 - 5 
Balmikis 1 - - - 1 
Gujjars з - 2 - - 2 
Jats 1 - 4 - 5 
Punjabis 4 1 1 - 6 
Rajputs - 1 - - 1 
Sikhs (Other) 1 - 1 - 2 
total 8 6 8 - 22 
Himachal Pradesh Dogras 5 9 8 2 24 
Punjabis 1 = = = 1 
Rajputs 1 = = 1 
total 7 9 $ 2 26 
Hyderabad Andhras 2 = Ба 1 3 
Christians (Tamil) = 1 = = 3 
Lingayats 1 5 = = 1 
Mahars 1 1 1 = 3 
Tamilians 1 = 1 
total 5 2 3 1 11 
Jammu & Kashmir Dogras 15 24 14 9 62 
Jats = 1 1 2 
Jammu Hindus 50 55 77 18 200 
Punjabis = 1 = - 1 
Rajputs = 1 2 =: 3 
Sikhs (other) 2 1 3 1 1 
total 67 83 97 28 275 
Madhya Bharat Ahirs = - = 1 2 
Gujjars 1 E 1 m 2 
Muslims (U.P.) Е = 1 - 1 
Rajputs 2 - 3 = В 
total 3 = 5 1 9- 
Madhya Pradesh ^ Adibasis 1 2 E - 1 
Bengalees - - 3 J s 
Christians (Tamil) 1 = 24 » 1 
Gujjars 1 = X x 1 
Mahars 8 9 9 2 22 
Marathas 8 8 if 3 A 
Punjabis 1 = P й d 
ye 20 17 19 7 63 
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TABLE 7. DISTRIBUTION OF SOLDIERS ACCORDING TO BLOOD GROUP BY COMMU- 
NITIES WITHIN EACH STATE (Continued) 


frequencies 
state community total 
O0 A B AB 
(1) (2) ..) (4) (3) (6) (7) 
Madras Andhras 31 20 28 5 54 
Christians (Other) 5 1 5 - 11 
Christians (Syrian) 6 2 4 1 13 " 
Christians (Tamil) ў 18 11 26 6 61 
Kanarese 3 - 1 - 4 ' | 
Malayalees 52 36 24 7 119 
Marathas - - 1 - 1 
Tamilians 114 39 82 15 250 à 
total 229 109 171 34 543 | 
Mysore Andhras 1 1 - 1 3 
Kanareso - 1 - 1 2 
Lingayats 1 - - - 1 
Mahars 1 - = - 1 
Tamilians 3 1 2 n " 
total 6 3 2 3 14 
Orissa Adibasis 2 1 1 = 4 
Andhras g =. = ur 2 
Oriyas 1 - 2 d 4 
total 5 1 3 10 
PEPSU 4 Ahirs 24 12 20 7 63 
Dogras = 1 ж E. 1 
Gujjars 2 1 3 M 
Jate 9 4 3 2 18 | 
Punjabis 1 1 1 2 5 
Rajputs 2 6 10 1 19 / 
Sikhs (М & В) 8 12 8 6 34 
Sikhs (Other) 16 4 6 3 29 
ES 62 41 51 21 175 | 
| 
Punjab ues \ 25 28 34 8 95 
Balmikis 1 3 6 3 18 
Dogras 39 63 52 27 181 ai. 
Gurkhas - 1 Ж T 1 
Gujjars 18 16 28 4 66 
Jats 30 16 24 D 15 
Punjabis 30 27 34 11 102- 
Rajputs 11 14 16 2 43 
Sikhs (M & R) 58 40 59 11 168 
Sikhs (Other) 79 54 76 17 226 
total 291 262 329 88 970 


10 | 


RELATION BETWEEN STATURE AND BLOOD GROUP 


TABLE 7. DISTRIBUTION OF SOLDIERS ACCORDING TO BLOOD GROUP BY COMMU- 
NITIES WITHIN EACH STATE (Continued) 


frequencies 


state community total 
о А B AB 
(1) (2) (3) (4) (5) (6) (7) 
Rajasthan Ahirs 8 5 7 2 22 
Balmikis - - 1 - 1 
Bengalees 1 - - - if 
Gujjars 23 19 28 3 73 
Hindus (U.P.) - 1 T = 1 
Jats 11 8 12 1 32 
Rajputs 67 21 32 8 128 
total 110 54 80 14 258 
Travancore Bengalees - - 1 - 1 
& Christians (Syrian) 31 20 19 1 71 
Cochin Christians (Tamil) - 1 1 - 2 
Malayalees 53 33 29 6 121 
Tamilians d 2 2 - - 4 
total 1 86 56 50 7 199 
U.P. Ahirs 16 12 9 6 43 
Balmikis 2 2 9 1 14 
Bengalees - 2 - - 2 
Garhwalis 40 75 57 24 196 
Gurkhas 1 1 2 1 5 
Gujjars 17 10 19 4 50 
Hindus (U.P.) 17 18 23 19 7i 
Jats - 2 2 1 5 
Kumaonis 50 64 57 29 200 
Muslims (U.P.) 15 9 19 4 47 
Punjabis 1 2 2 1 6 
Rajputs 41 20 39 10 110 
Sikhs (Other) М E 2 - 2 
total 200 217 240 100 757 
West Bengal Andhras > 2 1 = 1 
Bengalees 49 47 48 10 154 
Gurkhas 1 1 = 1 3 
total 50 48 49 11 158 
Nepal Gurkhas 71 61 46 12 190 
grand total 1480 1242 1406 415 4543 


6. COMPARISON WITH PREVIOUSLY PUBLISHED DATA 


Mourant (1954) has quoted figures of blood group frequencies supplied by 
House and Mahalanobis (1953) for a number of groups based on data collected from 
the Indian Army during the 1939-45 War. Some of these groups are comparable 
with corresponding groups in the present paper. Relevant figures are given in Table 8. 
Majumdar and Bahadur (1952) have listed a large number of Indian groups for 
which blood group data have been published. The groups, Jats and Rajputs, quoted 
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by these authors appear to be comparable with corresponding groups in the present 
paper. These figures are also shown in Table 8. 


TABLE 8. COMPARISON OF GENE PERCENTAGES 
Т 


gene percent ages 


group source of data number 
tested p q r 
(1) (2) (3) (4) (5) (6) 
Punjab Hindus present paper чаш 124 21.10 54.67 
House and Mahalanobis 615 18.05 56.01 
Rajputs present paper 320 15.02 22.33 62.65 
House and Mahalanobis 111 17.18 25.76 57.06 
Malone and Lahiri 118 19.60 55.18 
U. P. Hindus present paper 79 26.94 30.42 42.51 
House and Mahalanobis 838 19.12 23.94 56.94 
U. P. Muslims present paper 48 14.62 29.32 56.06 
House and Mahalanobis 109 17.98 26.16 55.87 
Jats present paper 138 15.77 22.50 61.73 
Malone and Lahiri 277 17.28 24.14 58.58 


The agreement between comparable gene percentages seem to be tolerably 
good except perhaps in the U.P. Hindus. The number of U.P. Hindus in the present 
sample is small and, moreover, there are a large number of castes in U.P. all of which 


may not have been represented in our sample. Majumdar and Rao (1958) give blood 


group data for a number of Bengal groups. The gene frequencies given by these 


authors in Table 9, p. 321 of their paper are consistent with the frequencies given 
in our Tables 6 and 7 for West Bengal and Bengalees. 


I must thank Dr. C. R. Rao for many helpful suggestions in the preparation 
of this paper. 
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THE NATIONAL SAMPLE SURVEY 
NUMBER 11 


REPORT ON 
THE SAMPLE SURVEY OF MANUFACTURING INDUSTRIES 
1949 AND 1950 


FOREWORD 


0.1 The National Income Committee*, which was set up by the Government 
of India in 1949, found that the coverage of the Indian Census of Manufacturing 
Industries (which had been initiated annually from 1946 under the Industrial 
Statistics Act of 1942) was incomplete in many respects. It did not cover Part B 
and Part C States ; and excluded 34 out of 63 groups of industries into which all 
factories were divided for the purposes of the Census. Further, although by that 
time a larger number of establishments should have been considered as factories 
according to the 1948 Factories Act, the Census had been working on the basis of 
the older definition of factories according to the 1934 Act. The difference was 
indeed large. Under the 1948 Act there were about 28,000 factories in the country 
in 1949 but according to the older definition there were only about 17,000 factories ; 
and the Census was covering between 6,500 and 7,000 factories only. 

y important for its 


0.2 The National Income Committee felt that it was ver 
turing 


А d . р М $ 
vork to have fairly reliable estimates of the contribution of the manufac 
On the recommendation of 


survey on à sample 
th the technical 
it was decided 


industries to the national income as early as possible. 
pics Committee, the Government of India agreed to а quick 
basis being carried out by the Directorate of Industrial Statistics wi 
collaboration of the Indian Statistical Institute. For immediate needs, 
to have a sample survey of factories, as defined under the 1934 Act, in the first 
instance. The size of the sample was 1742 ; and information on a brief schedule was 
collected directly by investigators who visited the sample factories. The survey 


started in January 1951 and was completed in June next. 


0.3 The survey was arranged in two instalments, and preliminary ostinata 
based on the first instalment of the data, processed by the Indian Statistical Institute, 
were furnished to the Committee by April 1951, within four months from the commen: 
cement of the survey. The final estimates on the full material were made available 


within six months after the completion of the survey. 


* consisting of Professor P. C. Mahalanobis (Chairman); Professor 24 В. Gadgil and Dr. У. К, В. V. 
Rao (Mombors), and Dr. R. C. Desai and lator Sri Mani Mohan Mukherjee (Secretary). 
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0.4 The field schedule, which has been reproduced at the end of the report, 
was simple and was designed to supply information on the number of persons 
engaged, wages and salaries paid, and the net value added which are of basic 


importance for studying the trend of industrial activities and the growth of n 


ational 
income, 


0.5 This survey had demonstrated the feasibility of using the sampling 
method on a ‘voluntary’ basis by the method of interview by investigators and its 
capacity to supply useful results quickly and at a low cost. Since then a Sample 
Burvey of Manufacturing Industries with coverage of factories, in accordance with 


the 1948 Act, is being carried out every year. and it is intended to publish the 
results regularly in future. 


29 August 1958 P. С. MAHALANOBIS 
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CHAPTER FIVE 
: Estimates of value of some selected items relating to manufacturing industries of 
India in 1949 and 1950 E 
: A few selected items relating to manufacturing industries in 1949 and 1950 
: Estimates of selected items for some industry groups in 1949 and 1950 .. 


: Estimates of fixed capital and output per worker for some industry groups in 1949 
and 1950 


: Total number of factories and the number of sample factories (relating to ten major 
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: Estimates of fixed and working capital and rent paid on fixed assets (in ten major 


manufacturing industries) in 1949 and 1950 m m ES 
: Production account of 61 manufacturing industries in 1949 and 1950 .. - 
: Estimates of cost items 9 “ж p . 
: Increase or decrease in costs of materials etc, in 1950 over 1949 "x 58 


: Estimates of output for all industries in 1949 and 1950 


: Percentage increase or decrease in the value of output in 1950 over 1949—ten major 
industries RE er 59 "b T. së 
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Working capital as percentage of invested capital in 1949 and 1950—ten major 
industries 


: Gross and net ratios of output to invested capital in 1949 and 1950 
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: Number of persons employed, per capita cost of employment and gross output per 
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: Capital and output per employed person in 1949 and 1950 
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THE NATIONAL SAMPLE SURVEY 


NUMBER 11 


REPORT ON 
THE SAMPLE SURVEY OF MANUFACTURING INDUSTRIES 
1949 AND 1950 


This report on the Sample Survey of Manufacturing Industries, 1949 and 1950 
was prepared by the Indian Statistical Institute and is being published in the form in 
which it was submitted to the Government of India. The views contained in the report 


are not necessarily those of the Government of India.* 


CHAPTER ONE 
INTRODUCTION 


1.1. This report presents results of the Survey of Indian Manufactures 
undertaken for the first time on a sampling basis and covers the calendar years 1949 
and 1950. 


1.2. The Survey was conducted to collect certain statistics for the use of 


t i е ; 
he National Income Committee and make them available within a very short period. 


А 1.3. For the purpose of the Census of Manufacturing Industries (СМТ), 
the industries have been divided into 63 groups. The census of manufactures covers 
only 29 out of these 63 groups. This report, however, gives estimates in respect of 
all the groups of Indian industries, except Railway workshops, repair shops and 
locomotive shops (CMI-58), and arms, ammunition and explosives (CMI-59) which 
were excluded from the present survey. The estimates in the report relate to the 
details of number of sample factories covered, fixed and working capital, employ- 
ment, wages and salaries, materials consumed, products manufactured and value 
added by manufacture. 


1.4. The work of planning the survey began in December 1950. As the 


National Income Committee wanted estimates by April 1951 for their preliminary 
report it was decided to divide the samples roughly into two equal parts. The field 
work in respect of the first part started by the middle of January 1951 and was 


*The draft report (Number D. 10) was submitted to the Government of India in November 1956. 
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completed by the third week of March. Preliminary estimates of the contribution of 
manufacturing industries to national income were furnished to the National Income 
Committee by April 1951. 

1.5. After the survey of the first part was over, the field work in the second 
part was taken up by the same set of investigators. The establishments included 
in this part were covered by the middle of June 1951. Final estimates were made 
available to the National Income Committee by the end of the year and all the parti- 
eulars based on both the parts taken together were analysed and the tables com- 
pleted by February 1952. 
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CHAPTER Two 


COVERAGE OF THE SURVEY 


2.1. The National Income Committee set up by the Ministry of Finance, 
Government of India, wanted statistics relating to manufacturing industries for 
estimating the contribution from large scale industries to national income. The 
figures for two calendar years, namely, 1949 and 1950 were wanted and a view was 


expressed that it would be convenient if some provisional figures could be made 


available by April 1951. 
2.2, Although the Directorate of Industrial Statistics, Ministry of Commerce 
ав conducting annual census of manufacturing 
industries, the lag between the completion of analysis and the years to which the 
data related was about 2 to 3 years. Therefore, when these figures for 1949 and 
1950 were wanted by the National Income Committee, the years for which the CMI 
figures were available went up to only 1948. The chance of obtaining figures relating 
to 1949 or 1950 on the basis of complete census by April 1951 was very remote indeed. 
The figures given in the censuses of manufactures were wanting in another respect 
also. The censuses were covering only 29 out of 63 groups of industries located in 
part A States, some of the important part B States and a few part C States. For 
national income purposes, larger coverage, both in respect of industries and in respect 


of geographical area, was naturally considered desirable. 


and Industry, Government of India, w 


2.3. Accordingly, a special inquiry on à random sampling basis to cover 
ts were made to obtain 


all the 63 industries in all States was planned and arrangemen 
The Government of India, at the instance of the 


sanctioned a scheme for this sample 
trial Statistics was made responsible 


the analysed results quickly. 
Chairman of the National Income Committee, 
Survey as an experiment. The Director of Indus 
for the organisation of the survey. 


he following groups of items and altogether 


2.4. The questionnaire included t 
each establishment : 


there were 37 different items for each of the years in respect of 


(i) value of fixed eapital which included land and building, plant and 


machinery and other fixed assets; 
(ii) value of working capital which included stocks of fuel and raw materials, 
stocks of products and by-products and partly finished products and 


cash in hand and at banks; 
(iii) rent of fixed assets secured on lease; 


(iv) duration of working period; 
(v) labour employed with various breakdowns, and wages and salaries 
paid to them; 
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(vi) value and quantity of input which included value of fuels, electricity, 
raw materials, chemicals and work done by other concerns; and 

(vii) value and quantity of output which included the value of products and 
by-products, and work done by the factory for customers. 


2.5 The definitions used for the items were the same as those used for 
the Census of Manufactures. 


2.6. As in the case of Census of Manufactures this survey was limited to 


manufacturing establishments employing 20 or more workers and using power. But the 
scope of the survey was extended to all States of the Indian Union and to all factories 
which come under Section 2(j) of the Factories Act, 1934 except two Government- 
run industries*, i.e., CMI-58 and CMI-59. The aggregate of all such manufacturing 
establishments was 17,377 exclusive of two industries mentioned above, according 
to the lists available with the Chief Inspectors of Factories of the different States. 


жа = = چوڪ‎ PEY EIC a a EE PRATER UT ERES 
m Some data in respect of these two industries are available from the Railway Board and the Chief 
tatistical Officer, Army Headquarters respectively. 
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CHAPTER THREE 
SAMPLING DESIGN AND ORGANISATION OF WORK 


3.1. The frame for sampling consisted of a classified list of factories in 
India. Every manufacturing establishment in India employing 10 or more workers 
with power and 20 or more workers without power is required to be registered under 
the Indian Factories Act 1948. These establishments are registered with the Chief 
Inspectors of Factories of different States. The frame for the present survey 
was, however, restricted only to establishments employing 20 or more persons using 
power because no list was easily available which included the smaller establishments. 
While collecting the lists from the Chief Inspectors of different States the names 
and addresses of occupiers-and the number of workers employed in establishments 


were also collected. 

Ё 3.2. For convenience, a few of the 61 industries actually surveyed were 
further sub-divided and the total number, taking account of the sub-divisions, came 
to 69. Within each of these 69 industries, the establishments were classified into 
a number of groups according to the number of workers employed. For a number 
of industries which showed marked concentration in particular areas, establishments 
falling under any size-class were further grouped according to States. Thus, there 


were altogether 589 strata into which the establishments were classified. 


ere selected at random with equal probability 


from each of these strata and the total of samples was 1,885. The samples were 


allocated to the different strata in proportion to the total number of workers employed. 
ely 1 in 9, the fraction between 


because the number 


3.3. Sample establishments w 


Although the overall sampling fraction was approximat 
the different strata varied considerably. In eight industries, 
of establishments was very small, all units were included in the sample. These 


eight industries are shown in Table (3.1). 


TABLE (3.1): LIST OF INDUSTRIES WITH THE NUMBER OF ESTABLISHMENTS IN 
EACH OF THEM 


c.m.i. number number of 


dustry establishments 

(1) (2) (3) 
1. sugar: gur and jaggery refineries 5(b) 7 
2. aluminium, copper and brass : primary producers 22(a) 3 
3. iron and steel : primary producers 23(a) i 
4. sewing machines 25 6 
5, producer gas plants 26 я 3 
6. electric lamps 27 10 
7. turpentine and resin 37 2 
8. petroleum refining * 39 H 


* return not received 
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3.4. The total sample was then divided roughly into two equal parts. The 
establishments under each part were scattered as widely as possible but the two parts 
were in effect not comparable sub-samples. After the survey of the first part was 
over, the second part was taken up by the same set of investigators. The first part 
of the sample was utilised to obtain preliminary estimates of certain major items 
quickly as the National Income Committee wanted the estimates by April 1951. 


FIELD ORGANISATION 


3.5. The staff employed for the survey under the Director of Industrial 
Statistics consisted of (i) an Officer on: Special Duty, (ii) six Regional Research 
Officers and (iii) thirty-two Investigators. India was divided into seven. regions for 
the field work. The work in Assam region was placed under the Statistics Authority, 
Assam. The remaining six regions were under the six Regional Research Officers. 
Each region was further divided into a number of investigator areas. | i 


3.6. The field work began in the middle of January 1951. The establish- 
ments included in the first part of the sample were surveyed by the third week of 
March. The survey of the establishments of the second part was then taken up and 
was completed by the middle of June 1951. 


SCRUTINY AND ANALYSIS OF DATA 


3.7. It was arranged that the Indian Statistical Institute would analyse 
the data and that before sending the completed schedules to the Institute, the Office 
of the Director of Industrial Statistics would scrutinise the returns. 


3.8, The schedules completed by the Investigators were forwarded to the 
Head Office at Simla after scrutiny by the Regional Research Officers. These were 
further scrutinised by the Officer on Special Duty and the Director of Industrial 
Statistics and, where necessary, the returns were referred back to the Regional 
Research Officers for correction of errors or omissions noticed. 


The returns were 
then sent to the Indian Statistical Institute for analysis. 


3.9. Although it was arranged at first that the work of analysis should start 
straightaway without further scrutiny, some checking was, however, found necessary 
in the tabulation stage when some minor defects such as disagreement between the 
components and the sub-totals, ambiguous entries etc., were discovered. 


3.10. In addition, the scrutiny of the tabulated results also constituted 
an important part of the analysis work; where the tabulation consisted of building 
up roughly 45,000 estimates some broad and suitable criteria for checking individual 
estimates had to be adopted. The estimates were, therefore, studied in the light 
of a number of criteria some of which were (i) ratio of fixed to working capital; 
(i) per-hour earning of a worker in different establishments of the same industry 
and (iii) ratios between the excess of value of output over input and labour charges 
on the one hand and the total capital employed on the other. 
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| 3.11. For both the years 1949 and 1950 the ratios were worked out for each 
industry. From the central tendency and scatter of these ratios the doubtful cases 
were noticed easily and both the computation sheets as well as the completed schedules 


were scrutinised again. 


3.12, Although 1746 establishments were surveyed, the total schedules 
for analysis stood at 1742 after scrutiny and rejections. For each establishment 
the number of items of information collected was 37 for each year, that is 74 in total 
for the two years. The estimates in respect of these 74 characters were obtained 
for each of the 589 strata. Besides, summary estimates in respect of the industries 
and of the different States were made for each of the 74 characters. 


3.13. As already stated, the establishments in the first part of the sample 
were surveyed by the third week of March and the completed schedules after scrutiny 
were sent to the Indian Statistical Institute by the 31st of March 1951. The 
d on this part were furnished to the National Income Committee 
April 1951 as required by them. The figures of both the parts 
whole survey was over. The tabulation of all 
bles were passed on to the National Income 


main results base 
by the first week of 
were taken up for analysis when the 
the details was completed and the ta 
Committee by the end of February 1952. 


Cost 


f the cost of this sample survey was just below 


3.14. The budget estimate 0 
actual cost under different broad headings 


a lakh of rupees. Round figures of the 


are given below. 


Planning Rs. 10,000 
Field work Rs. 80,000 
Processing and analysis Rs. 25,000 
کے‎ Fco == 
Rs. 1,15,000 
ی‎ ЗА ЗЕРЕ 


he budget estimates the cost of processing, 
analysis etc., was estimated at only Rs. 3,000 and this was a clear underestimate. 


The Indian Statistical Institute undertook to do the analysis in any case without 
regard to the cost which amounted to Rs. 25,000 approximately. Thus, although 


the budgeted figure was roughly Rs. 93,000, the actual cost was Rs. 1,15,000. 


3.15. It should be noted that in t 


23 


Vor. 21] SANKHYÀ: THE INDIAN JOURNAL OF STATISTICS [ Parts 1 & 9 
CHAPTER Four 
RELIABILITY OF ESTIMATES 


4.1. There are scarcely any data available in published form which can 
be used to test the reliability of the results of the survey in an exact way. For any 
proper comparison the coverage of the figures must be the same. Results of 1949 
and 1950 Census of Manufactures have since been published by the Directorate of 
Industrial Statistics. The census was restricted to 29 groups of industries. The number 
of factories covered was 6758 and 7099 in 1949 and 1950 respectively. The coverage 
of the sample survey was 7928 factories in both the years, so far as these 29 groups 
of industries were concerned. Because of this reason of wider coverage of the sample 
survey, even the estimates for 29 groups of industries are not strictly comparable 
with the census results. But this factor should make the census results lower than 
the sample estimates. In Table (4.1) comparative figures are indicated in respect 
of eight important items of information for the 29 groups of industries covered by 
the census. 


TABLE (4.1): COMPARISON BETWEEN SAMPLE SURVEY AND CENSUS RESULTS: 
1949 AND 1950 


ж 


к 1949 1950 
item unit === 
sample census рег cent sample census per cent 
survey difference survey difference 
(1) (2) (3) (4) (5) (6) (7) (8) 
5 Rs. 
1. fixed capital crore 316 228 38.6 346 258 34.1 
2. working capital » 386 282 37.0 375 356 5.0 
3. total invested capital 5 702 510 37.6 721 614 17.4 
4. emoluments of labour ss 197 177 11.1 184 172 7.1 
к r 
5. value of input » 803 687 15.7 833 726 13.4 
6. value of output » 1130 976 15.8 1164 1098 13.2 
7. ‘workers’ per day lakh 17 15 13.3 16 15 6.7 
8. man-hours crore 1122 969 15.8 1156 1023 13.0 
9. factories covered number 7928 6250 26.8 7928 6605 i 20.0 
10. sample size m 1013 — — 1013 ase чис 


42. Tt will be seen from the above table that in all cases the sample estimates 
are greater than the census results and that the gap between the results is smaller in 
1950 than in 1949. The factory coverage of the census in 1950 was much larger than 
in 1949, but still smaller than the survey. The divergence thus appears to decrease 
with the decrease in the gap between factory coverage of the census and the sample 
survey. The survey results are reasonably higher than the census results. There 
is, however, rather a large discrepancy regarding fixed capital. Tt is a difficult field 
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for collection of information, whichever be the method of collection. No definite 
observation on the merits of either of the results is possible, unless a factory-to-factory 
comparison, at least in the case of the sample factories, is made. But оон details 
of the census data are not available with the NSS. 


4.3. As mentioned earlier, the field inquiry was done in two parts. In 
the first part 798 sampi establishments from 61 industries including their sub-groups 
were covered and in the second part 944 establishments were covered from 68 indus- 
tries including the sub-groups. Because the samples in the two parts were not always 
scattered over common strata or even industries, the two half samples were not 
strictly comparable sub-samples. Hence, the estimated results got from the two 
parts of the sample are not comparable either. Since, however, due to time 
programme of analysis, the results of the two parts are available separately, a 


comparison between the two sets of figures may be of some interest. The figures 


are given in Table (4.2). 


ES OF SELECTED ITEMS AS OBTAINED FROM THE TWO PARTS OF 


TABLE (4.2): ESTIMAT 
THE SAMPLE 


———— 
sample estimates : Rs. (crore) per cent 
difference 
item first part second part combined 
1949 1950 1949 1950 1949 1950 1949 1950 
a) (2) (3) (4) (5) (6) (ee (8) (9) 
1. fixed capital 530.88 566.76 544.02 596.28 538.00 582.76 — 2.44 —'5:07 
16.98 34.81 


.90 459.19 530.05 518.20 


2. working capital 577.91 587.77 487 
3. amount received 

by labour 267.78 263.91 261.91 244.63 264.60 253.36 2.22 7.61 
4. value of input 1149.59 1105.12 1007.90 1072.63 1072.82 3128.69 12.16 10.85 
5. value of output 1777.21 1815.58 1398.25 1495.60 1571.39 1642.18 24.12 19.49 
Ee 

: Sample sizo 198 798 944 944 1742 1742 

— MÀ 0 


e agreement is not good. But 


4.4, Tt will be seen that in 3 of the 5 cases th 
disagreements because the two 


it is not possible to analyse the reason of such 
parts are not strictly comparable. 


CONCLUDING REMARKS 


4.5. The extent to which the respondents gave correct information is not 
as that the investigator would visit 


known. The method of collection of data w 

the establishments or the owners and complete the schedules there. As the complete 
addresses of the sample units were available, there were no difficulties in locating 
the establishments. The main difficulties in obtaining data were, in the first place, 
that the owners did not maintain statistical records of production and employment 
in the way these were wanted in the questionnaire and secondly, that in some cases 
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the owners were unwilling to furnish these particulars to a Government agency 
because they were suspicious of the motives of the survey. Where the records were 
incomplete the investigators obtained the estimates from the owners of the establish- 
ments. In regard to the second type of difficulty the investigators had to explain 
the purpose of the survey to the owners and to impress on them that these returns 
were not meant for income-tax purposes. Of the 1,885 samples selected 1,746 were 
actually surveyed. The proportion not surveyed was 7.4%. 


4.6. It may be noted that this problem of non-response and deliberate 
furnishing of inaccurate data is not a problem limited to sample surveys only but 
also common to censuses. The actual method of collecting data in the Censuses of 
Manufactures is by mailing questionnaires with the system of the field staff of the 
State Statistics Authorities assisting the occupiers of factories in filling the returns 
completely and accurately where necessary. The method followed in the sample 
survey was that the investigator had to visit the selected establishments in every 
case with a view to minimising non-response. 


4.7. In addition, as the sample size was only a fraction of the total establish- 
ments in the country, the number of schedules completed during the survey was 
considerably small. These completed schedules could, therefore, be scrutinised by 
the Regional Research Officers and then by the Officer on Special Duty and ulti- 
mately, in some cases, by the Director of Industrial Statistics. Wherever it was 
found necessary, the schedules were referred back to the investigators so that correct 
data might be obtained after clearing up the inconsistencies with the owners of the 
establishments. Thus it is reasonable to say that the data obtained are reliable. 


NATIONAL SAMPLE SURVEY : MANUFACTURING INDUSTRIES, 1949 AND 1950 
CHAPTER Five | 
SUMMARY RESULTS 


5.1. The tabulated results give the estimates for 37 items for the years 
1949 and 1950 arranged by industries. A summary of some of the results is given 
in this chapter. It may be mentioned in passing that the figures relate entirely to 
manufacturing industries and hence exclude other branches of productive activity 
such as trade, transport, commerce, mining etc. They also exclude particulars of 
small scale manufacturing establishments, not ‘covered by Section 2(j) of the 
Factories Act of 1934. 


5.2. The All-India estimates of the value of fixed and working capital, 
rent paid by establishments, amount received by labour, value of raw materials, 
value of input, value of output and the difference between the values of input and 
output are given in Table (5.1). The value of fixed capital, that is, the value 
of land and buildings, plant and machinery and other fixed assets amounted to Rs. 538 
crore in 1949 and Rs. 583 crore in 1950. The values were based on the original costs 
of the fixed capital plus the cost of improvements made less the amount written 
off as discarded. The rents paid for using fixed capital on lease amounted to less 
than 1 per cent of the value of fixed capital owned by the establishments. The 


totals of fixed and working capital employed by the manufacturing industries were 


TABLE (5.1): ESTIMATES OF VALUE OF SOME SELECTED ITEMS 
RELATING TO MANUFACTURING INDUSTRIES OF 
INDIA IN 1949 AND 1950 


Rs. (crore) 
item a ааа 
1949 1950 
(1) (2) (3) 
1. fixed capital 538.00 582.76 
2. working capital 530.05 518.20 
rent 3.55 3.60 
264.60 253.36 


amount received by labour 
1014.23 1067.32 


3 
4 
5. value of raw materials 
6 value of input 1072.82 1128.69 
7 value of output 1571.39 1642.18 
498.57 513.49 


1742 1742 


8. difference (7-6) 
9. 


sample size 


Rs. 1068 crore in 1949 and Rs. 1101 crore in 1950. The amounts received by 
labour including workers and other employees amounted to Rs. 265 crore in 1949 
and Rs. 253 crore in 1950. The values of raw materials used were Rs, 1014 
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crore and Rs. 1067 crore in 1949 and 1950 respectively. The difference between 
the values of output and input, that is, the value added by manufacture gross of 
depreciation amounted to Rs. 499 crore in 1949 and Rs. 513 crore in 1950. 


TABLE (5.2): A FEW SELECTED ITEMS RELATING TO MANUFACTUR- 
ING INDUSTRIES IN 1949 AND 1950 


item 1949 1950 
(crore) (crore) 

(1) (2) (3) 
l. number of working days 0.34 0.35 
2. total number of workers employed per day 0.24 0.23 


3. total number of persons other than 


worker employed per day 0.03 0.03 
4. total labour employed per day 0.27 0.26 
5. man-hours worked by workers 519.66 493.36 
6. electricity consumed (kwh) 198.20 202.73 
7. sample size 1742 1742 


5.3. The total number of working days of all manufacturing establishments 
was estimated at 0.34 crore for 1949 and 0.35 crore for 1950. The total number of 
workers employed per day was estimated at 0.24 crore in 1949 and 0.23 crore in 1950. 
When the employees other than the workers are taken into consideration the total 
of labour employed amounted to 0.27 crore in 1949 and 0.26 crore in 1950. The 
total quantity of electricity in kwh consumed by the manufacturing establishments 


was estimated to be 198 crore and 203 crore in 1949 and 1950 respectively. 


PARTICULARS BY INDUSTRY-GROUPS 


5.4. The particulars for six most important groups of industries in India 
judging from their value of output are given below as a matter of interest. The 
figures for the two years are shown separately in Table (5.3). 


5.5. Tt will be seen that the six industry groups in order of their importance 
are (1) manufacture of textiles, (2) manufacture of food and beverage, (3) manu- 
facture of chemicals and chemical products, (4) manufacture of basic metals, 
(5) ginning, pressing, decorticating and similar services to agricultural products, and 
(6) manufacture of machinery excluding electrical machinery and appliances. The 


lighter industries have thus much predominance in the pattern of our manufacturing 
activities, 
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TABLE (5.3): ESTIMATES OF SELECTED ITEMS FOR SOME INDUSTRY GROUPS IN 
1949 AND 1950 


| number of fixed working numberof valueof value of 
industry group sample capital capital workers input output 
units Rs. Rs. (thousand) Rs. Rs. 
(crore) (crore) (crore) (crore) 
(1) (3) (3) (4) (5) (6) (7) 
1949 
1. basic metals 51 51.89 42.93 93 47.86 82.48 
2. chemicals and chemical products 135 50.93 48.37 1,17 183.95 211.08. 
3. food and beverage 243 87.52 75.47 2,65 190.26 290.46 
4. ginning, pressing and similar services 
to agricultural products 160 26.77 7.43 1,38 61.31 64.91 
5. machinery excluding electrical 
machinery 147 31.11 28.98 1,48 29.28 57.55 
6. textile 521 123.37 205.81 11,71 388.11 577.20 
ی کے‎ eo 
1,257 371.59 408.99 19,32 900.77 1283.68 


7. total (1 to 0) 


24,24 1072.82 1571.39 


530.05 538.00 1,742 
e‏ ےک ا 


8. total of all industries 


l. basic metals 91 55.64 90.14 


2. chemicals and chemical produets 135 52.31 50.37 1,06 181.79 215.44 
3. food and beverage 243 96.40 70.30 2,71 201.63 317.43 
inni ressi imi vices 
: PRO Pam c rM T 160 26.55 5.88 1,41 61.53 66.47 
5. machinery excluding electrical 
machinery 147 35.43 28.85 155 32.41 62.76 
6. textile 521 134.44 200.80 10,85 398.92 570.87 


EE SS eee 
7. total (1 to 6) 1,257 401.23 399.80 18,49 931.92 1323.11 _ 
1128.69 1642.18 


8. total of all industries 1,742 582.76 518.20 23,37 


5.6. Table (5.4) shows the value of fixed capital per employed worker 
and the value of output per employed worker in the six groups of industries. 
The figures for all the industries grouped together are also given. The value of 


fixed capital per worker was highest in the basic metal industries. Next in 
d chemical products industries, food 


order of ranking the groups are: chemical an ae А 
and beverage industries, machine manufacturing industries, ginning, pressing and 


similar industries, and textile industries. It may be noted that this ratio for all 
industries was higher than the ratio for the first six groups of industries taken together. 


5.7. The value of output per worker was, however, highest in the chemical 
and chemical products industries. In order of ranking, the ener industries are 
food and beverage, basic metals, textile, ginning, Pressing and similar servicing, and 
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lastly manufacture of machinery. The productivity of workers of all industries 
taken together was roughly of the same order as that of the first six industry groups. 


TABLE (5.4): ESTIMATES OF FIXED CAPITAL AND OUTPUT PER WORKER FOR SOME 
INDUSTRY GROUPS IN 1949 AND 1950 
س‎ ed _ 


estimates per worker (Rs.) 


industry group fixed capital output, 
1949 1950 1949 1950 
(1) (2) (3) (4) (5) 

1. basic metals 5,580 6,165 8,869 9,905 

2. chemicals and chemical products 4,353 4,935 18,041 20,325 
3. ginning, pressing and similar services to 

agricultural products 1,940 1,883 4,704 4,714 

4. food and beverage 3,303 3,557 10,961 11,713 

5. machinery excluding electrical machinory 2,102 2,286 3,889 4,049 

6. textile 1,054 1,239 4,929 5,261 

7. total (1 to 6) 1,923 2,170 6,644 7,156 

8. total of all industries 2,219 2.494 6,483 7,027 


5.8. When compared between the two years, the value of fixed capital 
per worker increased to some extent from 1949 to 1950 in all the groups except in 
ginning, pressing and similarindustries. The value of output per worker also increased 
in varying extent from 1949 to 1950. Without going into the detailed tables of 
individual industries it is difficult to indicate as to how much of this increase was 
due to price variation and how much due to increase in quantity. 


PARTICULARS BY TEN MAJOR INDUSTRIES 


5.9. The sample survey of manufacturing industries, as stated earlier, 
covered 61 industries. The total number of establishments for which estimates 
have been made was 17,377. Out of these, factories belonging to the ten major 
manufacturing industries number 3050. Their distribution along with the samples 
taken in each case is as in Table (5.5). 


5.10. The ten major industries for selective review are cotton textile, jute, 
iron and steel, tea, sugar, chemicals, paper and paper board, tobacco, cement, and 
paints and varnishes. The number of factories covered by these ten industries was 
about 18 per cent of the total number of factories in all the industries, but accounted 
for 55.14 and 54.68 per cent of the total invested capital in all industries in 1949 and 
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TABLE (5.5): TOTAL NUMBER OF FACTORIES AND THE NUMBER OF 
SAMPLE FACTORIES (RELATING TO TEN MAJOR MANU- 
FACTURING INDUSTRIES) IN 1949 AND 1950 


 вььььБ‏ س 


total number 
industry number of of sample 

factories factories 
(1) (2) (3) 
D. cement 17 11 
a heavy chemicals 210 27 
3. cotton textile 755 302 
4. iron and steel 203 22 
5. jute textile 104 Ы 100 
6. paper and paper board 52 15 
7. paints and varnishes 48 8 
8. sugar 431 89 
d ian 1080 78 
10. tobacco 150 19 
11. total (1 to 10) 3050 671 
E total of all industries 17,377 1742 

NE — 


1950 respectively. Table (5.6) sets out the position of these major industries of 
Tndia with regard to their size as measured by capital outlay and the extent to 
which they own fixed assets in comparison with the position of all industries. 


5.11. № will be seen that cotton textile, iron and steel, and jute by them- 
selves make for about 34 per cent of the total capital outlay in all industries. An 
observation of the figures of fixed capital for the ten industries, in the following table 
brings out an upward trend in the fixed capital investment in these industries in 1950 
over 1949. This is a sign of development of these industries. The figures 
of working capital for these major industries, however, indicate in 1950 ә falling 
tendency аѕ compared to those of 1949, the only notable exception being vig industry 
where working capital rose in 1950 by 19 per cent over that in the previous year. 
In so far as the rented fixed assets are concerned the year 1950 appears to have been 
marked with an effort on the part of these industries to reduce rent payments on 
fixed assets—from Rs. 58 lakh in 1949 to Rs. 51 lakh in 1950—by owning more 


The amount of rent paid on fixed assets for all industries recorded a 


fixed assets. п 1950. Exceptions in 


slight increase from Rs. 3.55 crore in 1949 to Rs. 3.60 crore i 
this regard are paper and paper board, tea and cement industries, even though the 
first two of these otherwise effected an increase in their working capital. 
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TABLE (5.6): ESTIMATES OF FIXED AND WORKING CAPITAL AND RENT PAID ON FIXED 
ASSETS (IN TEN MAJOR MANUFACTURING INDUSTRIES) IN 1949 AND 1950 


Hs. (crore) 

——————————————O—o——E 

total ront paid 
industry fixed worki invested on fixed 
capital capital capital nssets 
a) (2) (3) (4) (5) 
1949 
l. cement 10.74 5.91 16.65 0.03 
2. chemicals 18.14 12.18 30.32 0.08 
3. cotton textiles 79.53 150.82 230.35 0.11 
4, iron and steel 43.67 24.20 67.87 0.02 
5. jute textiles 26.74 38.93 65.67 0.09 
6. paints and varnishes 0.72 1.98 2.70 0.01 
7. paper and paper board 15.08 6.39 21.47 0.00 
8. sugar 20.66 37.55 58.21 0.11 
9. tea 50.80 23.02 13.82 0.09 
10. tobacco 6.40 15.44 21.84 0.04 
11. total (1 to 10) 272.48 316.42 588.90 0.58 
12. percentage to total of all industries 50.65 59.70 + 55.14 16.34 
13. total of all industries 538.00 530.05 1068.05 3.55 
1950 
1. cement 10.70 5.53 f, 
5.5 16.23 0.04 
2. chemicals 
са, 17.91 19.17 30.08 0.08 
3. cotton textiles 88.30 146.96 235.26 0.10 
4. iron and steel 47.15 23.34 71.09 x08 
5. jute textiles 28.62 37.53 66.15 0.05 
6. paints. and varnishes 0.55 1.61 2.16 0.01 
7. paper and paper board Е 17.93 76,94 24.17 0.00 
8. à y о 4 
sugar 26.69 30.59 57.28 0.09 
OF ot 

еа 53.00 27.40 80.40 0.09 
a tobacco Е 6.83 12.39 19.22 0.03 

11. total (1 to 10) 298.28 303.76 602.04 0.61 
12. percentage to total of all industries 51.18 58.62 54.68 14.17 
13. total of all industries 582.76 518.20 1100.96 3.60 
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5.12. The gross income of 61 industries was Rs. 1571 crore in the year 
1949 and Rs. 1642 crore in the year 1950. They are distributed in Table (5.7) 


TABLE (5.7): PRODUCTION ACCOUNT OF 61 MANUFACTURING INDUSTRIES 
IN 1949 AND 1950 


Rs. (crore) 
a item 1949 1950 
а) (2) (3) 
A. Value of production 
1. products and by-products 1529.48 1601.95 
2. work done for other concerns 41.91 40.23 
3. total (1 + 2) 1571.39 1642.18 
B. Value of input Kiel 
4. raw materials and chemicals ete. 1014.21 1067.16 
5. fuels, lubricants otc. 51.31 53.63 
6. work done by other concerns 7.30 7.90 
ud 7. total (4 to 6) в 1072.82 1128.69 
C. Depreciation estimated @ 7% 37.66 40.79 
D. Value added by manufacture (net of depreciation) 
8. salaries, wages and other, benefits received by labour 264.60 253.36 
9. balance available for other purposes 196.31 219.34 
10. total (8 + 9) 460.91 472.70 
rately 


5.13. The figures of depreciation on fixed assets were not collected sepa 
in this survey. However, the Income Tax Manual, Part II, 1954, in accordance with 
Section 10(2)vi of the Income Tax Act, 1922, prescribes a general rate of depreciation 
on fixed assets at 7 per cent of the value of such assets. Worked on this basis 
the amount of depreciation on fixed assets comes to Rs. 37.66 crore in 1949 and 
Rs. 40.79 crore in 1950. Thus, the value added by manufacture, net of depreciation, 
comes to Rs. 460.91 crore in 1949 of which Rs. 264.60 crore or about 57.4 per cent 
were shared by wages and salary earners. In the following year (1950), the value 
added, net of depreciation, comes to Rs. 472.70 crore of which Rs. 253.36 crore, or 


about 53.6 per cent went to wages and salary earners. 


5.14. The comparative values in the ten major industries as against those 
in all the industries, in respect of the components of the gross income and gross 
expenditure are examined later in details under the separate section devoted to these 
items. The comparative gross income in the ten major industries was Rs. 860 crore 
in 1949, and Rs. 883 crore in 1950, forming respectively 54-7 and 53.8 per cent of the 
gross income of all industries. 
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5.15. Cost of materials: The questionnaire called for data on the quantity 
and purchase value of each material consumed during the year. Only materials, 
which were purchased, have been included. Materials made in the factory have not 
been included. The purchase value of the quantity of material purchased during the 
year has been taken as equal to the cost of material landed at the factory, i.e., any 
expense incurred in transporting the materials to the factory have been added to 
the payment made to the seller of the material unless transport was carried out by 
the factory’s own staff. If any duty was paid by the factory, it has also been 
added to the amount paid to the seller. Particulars relating to goods, which were 
not subjected to any manufacturing process but were merely bought and re-sold 
in the same condition as received, have been excluded. The total amount paid to 
other firms or factories for work done on materials given out to them plus 
transport and any other charges incurred on these goods have been included. 


5.16. Fuel and electric energy used: The quantities of the several kinds of 
fuel (coal, coke, fuel oil and gas) used, the quantity of electric energy purchased 
and the quantity of water used by manufacturing establishments have been reported 
together with the cost of each. Fuel, electricity etc., produced in the factory 
have not been included. If any electricity generated (or coal gas produced) is sold to 


any person or transferred to allied concerns, such electricity (or coal gas) has been 
regarded as a product. 


5.17. The All-India figures of costs in all industries are shown in Table (5.8). 
It would appear that the total costs in all industries recorded an increase of Rs. 56 


56 


TABLE (5.8): ESTIMATES OF COST ITEMS 


Rs. (crore) 
items of cost 1949 1950 
(1) (2) (3) 
l. raw materials 1014.21 1067.16 
2. fuels etc. 51.31 53.63 
3. work done for the factory by others 1.30 1.90 
4. total 1072.82 1128.69 


crore or 5.2% in 1950 over the previous year. 


Of the total cost the cost of raw 
materials formed on an average 94.5 per cent. 
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5.18. The increase or decrease in the costs of materials etc., for each of the 
ten major industries is shown in Table (5.9). 


TABLE (5.9): INCREASE OR DECREASE IN COSTS OF MATERIALS ETC. 


IN 1950 OVER 1949 
Rs. (lakh) 


а 
work done 


raw fuel, for the 
industry materials lubricants factory total 
and chemicals and electricity by other input 
concerns 

(1) (2) (3) (4) (5) 
йт cement + 141.28 + 22.79 + 0.44 + 164.51 
2. chemicals + 187.19 + 24.29 — 0.25 + 211.23 
3. cotton textile +1345.53 + 26.71 — 25.19 4-1347.05 
4. iron and steel + 89.67 + 29.23 — 1.64 + 117.27 
5. jute textile — 775.13 — 21.91 + 4.03 — 793.02 
6. paints and varnishes — 42.51 = 10.10 = — 42.60 
7. paper and paper board + 98.18 + 39.23 — 1.34 + 136.07 
8. sugar + 98.90 — 9.46 + 0.87 + 90.31 
9. tea + 311.38 + 24.47 + 7.88 + 343.73 
10. tobacco + 386.19 + 0.01 — 1.39 -- 384.81 
11. total --1840.68 --135.26 — 16.59 4-1959.36 


industries these costs as compared to those for 


5.19. For the ten major 
all the industries were slightly lower at 49.27 per cent in 1950 compared to 50.01 


per cent in 1949. Industry-wise, the raw materials and chemicals consumed were 
on the increase in 1950 compared to the previous year, paints and varnishes and jute 


textiles being exceptions. 

5.20. . The fuel cost of the ten major industries in 1949 and 1950 as a per- 
centage of corresponding figure for all the industries, however, remained more or less 
stationary at 60 per cent. Here again in sugar, paints and varnishes, and jute textile 
industries, the fuel cost in the latter year was lower. The other seven industries 
exhibited an increase in their fuel costs. 

5.21. The cost of work done by other concerns for 10 industries compared 
to the corresponding item for all industries declined from 42 per cent in 1949 to 37 
per cent in 1950. In cotton textile industry, in particular, where the work done for 
the industry by other concerns generally covers quite a few processes, the cost on 
this account was lower by Rs. 25.19 lakh in the latter year, indicating thereby the 
industry’s capacity to complete many such processes by itself. Paper and paper 
board, iron and steel, and chemicals were other industries in order of precedence which 
effected economy in 1950 over the year 1949 i 


for them by other concerns. In the jute and tea industries, on the other hand, the 


cost of work done for them by other concerns showed an increase in the latter year, 


35 


n respect of payments for work done 


Vor. 21] ЗАМКНУА : THE INDIAN JOURNAL OF STATISTICS [ Parts 1 & 2 


PRODUCTS AND BY-PRODUCTS 


5.22. The value of products and by-products in all industries was Rs. 1529.48 
and Rs. 1601.95 crore in 1949 and 1950 respectively which meant an increase in 
the latter year by 4.7 per cent. On the other hand, the value of work done for others 
in all industries was 4 per cent lower in 1950 than that in the previous year. This 
is shown in Table (5.10). 


TABLE (5.10) : ESTIMATES OF OUTPUT FOR ALL INDUSTRIES IN 1949 AND 1950 


Rs. (crore) 


item 1949 1950 
(1) (2) (3) 
1. products and by-products 1529.48 1601.95 
2, work done for others 41.91 40.23 
3. total 1571.39 1642.18 


5.23. As a percentage of all industries’ total, the value of products and 
by-products for the ten major industries in the year 1950 registered a nominal decline 
over the year 1949 although in absolute terms these items for the ten major industries 
showed an increase of 2.7 per cent in 1950 over the previous year. The following 
table presents the percentage increase or decrease in the value of products and by- 


products as well as that in the value of work done by the factory for others in 1950 
over the year 1949. . 


TABLE (5.11): PERCENTAGE INCREASE OR DECREASE IN THE 
VALUE OF OUTPUT IN 1950 OVER 1949—TEN MAJOR 
INDUSTRIES 


~. 
_ ——. 


industry products and work done 
by-products for others 
(1) (2) (3) 
1. cement 437.14 +23.45 
2. chemicals +18.53 --30.20 
3. cotton textile — 4.00 -- 8.84 
4. iron and steel = аш --20.83 
5. jute textile + 2.41 +28.51 
6. paints and varnishes — 22.54 =: 
7. paper and paper board . +18.23 -— 
8. sugar + 6.41 —45.22 
9n don --16.25 —59.73 
10. tobacco T11.65 —18.31 
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5.24. It is evident from the above that the value of products and by-products 
was on an increase in the year 1950, in order of precedence, in the cement industry, 
chemicals, paper and paper board, tea, tobacco, sugar and jute textiles while it was 
lower in respect of paints and varnishes, cotton textiles and iron and steel industries. 


5.25. The work done by factory for customers is a source of revenue to the 
industry. The value of such work done in the ten major industries as a percentage 
of this work for all industries recorded an increase of 0.3 per cent in 1950 over 1949. 
For these ten industries themselves, the year 1950 recorded a rise in value in this sphere 
of the order of 9.2 per cent over that in the year 1949. The percentage increase in 
value in the year 1950 compared to the previous year was 30.20 in chemicals, 28.51 in 
jute, 23.45 in cement, 20.83 in iron and steel and 8.84 in cotton; the percentage 
decrease in value was of the order of 59.73 in tea, 45.22 in sugar, 18.31 in tobacco. 


VALUE ADDED BY MANUFACTURE 


5.26. In our survey the concept of ‘net value of production’ has not been 


5.2 
We have used a similar concept, ‘value added by manufacture’. This 


alue of commodities by the manufacturing process 
containers, fuel, 


used. 
measures the increase in the total v 
and is calculated by subtracting the cost of materials, supplies, 
purchased electric energy, contract work, and the depreciation of fixed assets from 


the total value of products and work done by the industry for customers. 


5.27. The figure thus calculated is somewhat larger than the net value of 


production because many miscellaneous expenses, such as commission on sales, 
insurance and advertising, have not been taken into account. Therefore, it must 
not be inferred that when wages and salaries and undistributed profits are deducted 
from these values added by manufacture the whole of the residue is available for 


non-wage factor payments. The relevant figures are given in Table (5.12). 


TABLE (5.12): VALUE ADDED BY MANUFACTURE IN 1949 AND 1950 
—ALL INDUSTRIES 


Rs. (crore) 
и 
item 1949 1950 Е: 
(1) (2) (3) 
]. value of input 1072.82 1128.69 
9. depreciation estimated @7% 37.66 40.79 
3. value of output 1571.39 1642.18 
4. value added (net of depreciation) 460.91 472.70 


cture in all industries stood at Rs. 460.91 
respectively, marking an increase 
ding figures for ten major indus- 


5.28. The value added by manufa 
and Rs. 472.70 crore in the year 1949 and 1950 
of 2.56 per cent in the latter year. The correspon 
tries are given in Table (5.13). 
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TABLE (5.13): VALUE ADDED BY MANUFACTURE IN 1949 
AND 1950—TEN MAJOR INDUSTRIES 


Rs. (crore) 


item 1949 1950 

(1) (2) (3) 
1. value of input 536.56 556.15 
2. depreciation estimated @7%, 19.07 20.88 
3. value of output 860.15 883.33 
4. value added (net of depreciation) 304.52 306.30 


5.29. The table above discloses that the value added by manufacture in the 
ten major industries in comparison with that for all the industries recorded a decline 
of 1.3 per cent in 1950 over the previous year. The comparative total value added 
by manufacture for these ten industries alone, however, showed a rise of 0.58 per 
cent in 1950 over the year 1949. 


5.30. It would be of interest to have an idea of the value added per worker 


in all industries as well as in the ten major industries in 1949 and 1950. The value 
added per worker for all industries was Rs. 1901 in 1949 and Rs. 2023 in 1950 as 
against Rs. 2108 and Rs. 2247 respectively for the ten industries. 


INVESTMENT 


5.91. Capital structure: All particulars under this head are as they were on 
the 3156 December 1949 or 1950 or the date on which the factory 
accounts. ‘Value’ in all the headings specified under the item ‘productive capital 
employed’ should be taken to mean value according to the books of the factory on 
the date to which the particulars furnished under this item relate. For terna of 
fixed capital, these are the original cost plus the cost of improvements iii Tess 
amount written off. In case a factory occupies only a portion of any building or "el 
piece of land, particulars relating to only that portion had been included. In the 
case of any item of fixed capital which had been leased or rented, the — had 
been shown separately. In calculating this rent any lump sum consideration that 
was originally paid for securing the items of fixed capital in question either on lease 


or on rent, the present book value of the amount orici : 
к 3 ginally paid had been i 
in the amount of the rent. P рее, 


last closed 


5.32. The invested capital in all industries stood at Rs. 1068.05 and 
Rs. 1100.96 crore in 1949 and 1950 respectively showin | 


А 5 a rise of about 3 per cent 
in the latter year. Compared to this, the fixed capital in all industries Was us 538.00 
and Rs. 582.76 crore during 1949 and 1950, recording a ri ELO US B 
А а rise of 8.32 per à 
latter year. 8 per cent in the 


The comparative figures for working capital were Rs. 530.05 and 
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Rs. 518.20 crore during 1949 and 1950 respectively showing a decline of 2.24 per 
cent in the latter year. This is brought out in Table (5.14). 


TABLE (5.14): ESTIMATES OF FIXED AND WORKING CAPITAL 
FOR ALL INDUSTRIES IN 1949 AND 1950 


Rs. (crore) 

item 1949 1950 

(1) (2) ° (3) 

i fixed capital 538.00 582.76 
530.05 518.20 


te 


working capital 


3. total 1068.05 1100.96 


5.33. The total capital investment in the ten major industries comprised, 
as already observed, 55.14 and 54.68 per cent of the capital invested in all the 
industries taken together during the years 1949 and 1950 respectively. On the other 
hand, the fixed capital invested in these ten major industries comprised respectively 
50.65 and 51.18 per cent during 1949 and 1950 of the total fixed capital outlay in 
all the industries. The working capital invested in these industries formed 59.70 
and 58.62 per cent during the year 1949 and 1950 respectively of the total working . 
capital in all the industries. While it would be needless to repeat here the table 
giving figures of fixed, working and invested capital, for the ten major industries, 
it would be no doubt instructive to observe the relationship between working capital 
and total capital invested in Table (5.15). 


TABLE (5.15) : WORKING CAPITAL AS PERCENTAGE OF 
INVESTED CAPITAL IN 1949 AND 1950—TEN 
MAJOR INDUSTRIES 


Б‏ ا ے 


industry 1949 1950 
(1) (2) (3) 

1. cement 35.52 34.07 
2. chemicals 40.16 40.44 
3 cotton textile 65.48 62.47 
4. iron and steel 35.65 32.83 
5. jute textile 59.29 56.73 
6. paints and varnishes 73.44 74.72 
Че paper and paper board 29.74 25.82 
8. sugar 64.51 53.40 
9. tea 31.19 29.04 
10. tobacco 70.70 64.47 
49.63 47.07 


Tl. all industries 
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5.34. The year 1950 is marked by a general tendency in all these industries 
for the working capital to be lower as compared to the previous year. During the year 
1949, the proportion of the working capital to invested capital was the highest at 
73.44 per cent in the paints and varnishes industry followed by 70.70 per cent in 
tobacco, 65.48 per cent in cotton textile, 64.51 per cent in sugar and 59.29 per cent 
in jute textile industry. 


5.35. The output as a percentage of the invested capital in all industries was 
of the order of 147 in 1949 and 149 in 1950. Аз against this the value added for all 
industries as per cent of the invested capital was broadly 43 per cent of the invested 


capital for both the years. Table (5.16) shows the output and value added as 
percentages of invested capital. 


TABLE (5.16): GROSS AND NET RATIOS OF OUTPUT TO INVESTED CAPITAL 
IN 1949 AND 1950 
percentage of invested capital 
e output " value added* 
1949 1950 1949 1950 
(1) (2) (3) (4) (5) 

1 cement 74.28 104.50 28.78 47.71 
2. chemicals 75.57 90.33 34.98 42.42 
3 cotton textile 165.56 155.63 56.86 43.23 
4. iron and steel 85.51 80.78 37.71 33.10 
5. jute textile 220.32 224.00 50.71 67.40 
6 paints and varnishes 166.83 161.60 58.55 46.45 
7 paper and paper board 63.75 66.97 20.62 223.91 
8. sugar 163.37 176.62 50.32 59.42 
9 tea 124.91 133.32 73.49 81.58 
10. tobacco 161.69 205.08 44.61 51.90 
11. total (1 to 10) 146.06 146.72 51.71 50.88 
12. total of all industries 147.14 149.12 43.15 49.93 


5.36. The value added as a percentage of the invested capital is markedly 
higher in the ten major industries. 


Among these industries the highest output 
expressed as a percentage of invested capital is recorded in order of precedence in jute 
E 


tobacco, and sugar industries whereas the value added аз percentage of the invested 
capital is the highest in tea, jute and sugar industries, 


LABOUR AND THEIR EARNINGS 


5.37. According to the Census of 1951 the total population of India was 
3613 lakh for the year 1951. The Census Report observes that there is a recurring 


net annual increase in our population of 1.3 per cent or 44 lakh. Applying this 


*net of depreciation 
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annual rate of increase to the population figure for 1951, we arrive at a population 
figure of 3525 lakh for the year 1949, the corresponding figure being 3569 lakh for 
the year 1950. Table (5.17) gives the figures of total population, aggregate 
self-supporting working population, and the self-supporting working population 
in industries. 
TABLE (5.17) : TOTAL AND WORKING POPULATION COMPARED TO THE 
WORKING POPULATION IN INDUSTRIES* 


(lakh) 
workin, opulati 
year total E popsas 
population total industries manufacturing ten major 
industries industries 
(1) (2 (3) (4) (5) (6) 
1949 3,525 1,009 89 27 16 
1950 3,560 1,021 91 26 15 


5.38. The 1951 Census enumerated the total working population in the 
country at 1044 lakh of persons or 28.62 per cent of the total population. Out of 
this, 334 lakh persons or 9.24 per cent were reported to be engaged in non-agricultural 

aged in industries of all types and sizes (7.¢., in process- 


occupations. The workers eng 
ing and manufacturing) including such establishments as are covered by the Factories 
54 per cent of the total popula- 


Act were, however, returned at 92 lakh in 1951 or 2 
ge of working population to total population in 1951 
e at the figures of total 


ation for these years as shown in the table above. Similarly, the 
ng the years 1949 and 1950 have been arrived 
s engaged in industries in 1951, $.е., 2.54 


tion. Applying the percenta 
to the total population figures for 1949 and 1950, we arrivi 
working popul 
workers engaged in industries durir 
at by applying the percentage of worker 
per cent to the total population figures of the years 1949 and 1950. 


5.99. Taking the proportion of self-supporting working population to total 


of the self-supporting working population in industries to the 
btained in the Census figures for 1951, the working population 
works out to be 1009 lakh in 1949 and 1021 lakh in 1950. The working population 
in industries, on the same basis of calculation, comes to 89 lakh in 1949 and 91 lakh 
in 1950. These figures are broadly comparable to the figures of persons employed 


in all industries, as well as in the ten major industries, as estimated in our survey. 
The figures of total working population in the 61 organised industries as given by our 
were 27 lakh or 30 per cent in 1949 and 261 29 per cent in 1950. The 


survey V akh or 
population in the ten major industries Wa 


population and that 
total population as О 


working s 16 lakh in 1949 and 16 lakh 


in 1950. When compared to the working population in organised industries alone, 


the figures of persons employed in the ten major industries comprised 58 per cent 
49 and 1950 respectively. Coming to individual industries we 


and 57 per cent in 19 
and jute between them employed 45 per cent 


find that iron and steel, cotton textile 
of the total workers in all the industries. 


* Only self-supporting persons are included in working population. 
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5.40. The duration of work in organised industries is governed by the 
Factories Act, 1948, both in regard to perennial factories which work all the year 
round and the seasonal factories which work during a particular season like the sugar 
industry which works during the period sugar-cane is available. The total working 
days in all industries showed an increase in the year 1950 of the order of 1.8 per cent 
over the year 1949, the actual figures being 34,34,000 in 1949 as against 34.96,000 
in 1950. 


5.41. Out of the total working days for all industries, those in the ten major 
industries formed broadly 20 per cent during both the years. Total man-hours 
worked for all the industries stood at 519.66 crore and 493.36 crore in 1949 and 
1950 respectively. Out of these, the man-hours worked in the ten major industries 
accounted for 62 and 59 per cent in 1949 and 1950 respe 
man-hours worked during the two years were in cotton t 
steel industries in the descending order. 


ctively. The maximum 
extile, jute and iron and 


9.42. In this survey each manufacturer was asked to report the average 
number of employees per day receiving pay within the calendar year. The employees 
were classified into two broad groups, namely. (i) wage earners and (ii) others. 
‘Employees’ include all administrative, technical and clerical staff working w 
the factory area and all those engaged in effecting delivery of the output. 
persons employed in any retail sales organisation maintained by the 
those engaged in sale of goods which were not subjected to апу m 
but merely bought and re-sold have been excluded. 
administrative offices outside the factory area h 


ithin 
But 
factory and 
anufacturing process 
The personnel of the central 
ave also not been included. 

5.43. Wage earners and wages : Wage earners 
are, generally speaking, those who perform manual 
machines, handling materials and products and с 
They comprise of both time-workers and piece-wo 
a person solely employed in a clerical capacity 
manufacturing process is carried on. 


in manufacturing plants 


work using tools, operating 


are for plant and its equipment. 
rkers. Worker does not include 
in any room or place where no 

5.44. The average number of persons employed per day has been worked 
out by dividing the aggregate number in attendance on all working days by the total 
number of working days during the year. 


In reckoning attendances, attendances by 


à regate of daily attend: 3 
in respect of all working days. Absence for a few hours on] s мое 
Total attendances on any day are the t 
that day. Days on which the factory w 


5.45. The total amount of wages paid to workers in all industries inclusive 
of other benefits stood at Rs. 213.40 crore in 1949 and Rs. 199.64 crore in 1950 
thus recording a decrease of approximately 6.0 per cent in the latter year, Thi 
is shown in Table (5.18). os j 
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TABLE (5.18) : ESTIMATES OF WAGES AND SALARIES PAID 


Rs. (crore) 
item 1949 1950 
(1) (2 (3) 


1. wages (inclusive of benefits for workers) 


2. salaries (inclusive of benefits for persons other than workers) 51.20 


5.46. The total salaries paid to persons other than workers in all industries 


recorded an increase of 5 per cent in 1950 over the previous year. 

5.47. The wages bill inclusive of other benefits in respect of all the workers 
per working day in the ten major industries was of the order of 65.5 per cent of the 
total wages including benefits for all industries during the years 1949 and 1950. The 
highest absolute wage inclusive of other benefits was recorded in cotton textile, followed 
by that in jute, iron and steel, sugar and tea, the first three alone accounting for 
about 86 per cent of the wages bill for the ten major industries, and to about 57 
per cent of the wages bill for workers for all industries. The rates of remuneration 
for workers and for persons other than workers are set out in Table (5.19). 

п د‎ ee = == >> > > 
TABLE (5.19): ESTIMATES OF WAGES AND SALARIES FOR WORKERS AND PERSONS 
OTHER THAN WORKERS IN 1949 AND 1950 


other than workers 


workers 

average earnings 

industry wages including average earning wages including average earnings per employed 

benefits per worker benefits per person person per 
Rs. (crore) per day Rs. (crore) per day working day 
Rs, Rs. Rs. 
1949 1950 1949 1950 1949 1950 1949 1950 1949 1950 
(1) (2) (3) (4) (5) (6) (7) (8) (9) a0 (1) 
l. coment 1.59 1.57 2.60 2.76 0.36 0.44 5.18 6.62 2.87 3.17 
2. cotton textile 89.18 76.06 4.15 4.26 10.30 10.09 .22 8.92 4.87 4.59 
3. heavy 
chemicals 3.10 2.99 3.90 3.55 1.40 1.91 8.00 9.57 4.63 4.71 
4. iron and steel 8.69 10.29 6.56 7.30 3.34 2.97 11.18 8.65 7.40 7.56 
5. jute textile 25.24 22.60 — 3.20 3.00 3.07 3.13 7.15 7.86 3.40 3.34 
6. paints and 
Finishes 0.23 0.15 2.66 2.26 0.13 0.13 5.76 5.81 3.31 3.17 
E er and 

; Быр board 1.94 1.70 2.87 2.09 0.68 1,22 7.20 10.99 3.40 3.16 
8. sugar 6.37 6.09 7.43 5.85 2.31 2.37 10.40 9.41 8.04 6.55 
9. tea 4.46 5.71 1.97 2.42 2.19 2.32 6.07 6.18 2.53 2.94 
10. tobacco 1.38 1.46 2.47 2.31 0.44 0.54 5.74 6.03 2.87 2.71 


.00 8.27 4.68 4.53 


1l. total 
(1 to 10) 142.18 128.62 4.37 


12 8.91 9.22 4.92 4.80 


12. total of all - nt 
industris 213.40 199.64 4.45 4.25 51.20 53 

13. per cent to 
total of all я = ч 
industrios 66.63 64.42 98.20 97.88 47.30 46.76 89.79 89.70 95.12 94.38 
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5.48. The highest absolute salaries including benefits for persons other 
than workers are in cotton textile, jute, iron and steel, sugar and tea. The first three 
of these account for 69 per cent of the total salaries for the ten major industries and 
about 33 per cent of the total salaries including benefits for all industries for employees 
other than workers. 


5.49. The average earnings per worker per working day in all industries 
were Rs. 4.45 in 1949 and Rs. 4.25 in 1950 whereas for persons other than workers 
the corresponding figures were Rs. 8.91 in 1949 and Rs. 9.22 in 1950. Thus in all 
industries the average earnings for workers per working day dropped in 1950 by 
about 5 per cent whereas the earnings increased by 3 per cent for persons. other than 
workers. The overall average earnings per working day for all employed persons 
recorded, however, & nominal fall in 1950 of the order of 2 per cent. 

5.50. In the ten major industries the average wages per worker por working 
day were Rs. 4.37 in 1949 and Rs. 4.16 in 1950. The average earnings per worker per 
day in the ten major industries as compared to those in all industries were of the 
order of 98 per cent, the maximum average earning per worker being in iron and 
steel, sugar, cotton textiles and jute textile industries in descending order. 


5.51. The rate of earnings for workers and for persons other than workers 
in all industries were Rs. 4.92 in 1949 and Rs. 4.80 in 1950. The comparative average 
earnings in the ten major industries were Rs. 4.68 in 1949 and Rs. 4.53 in 1950. 

i 5.52. Table (5.20) shows the number of workers in relation to total 
wages and value of input and output. The total number of workers in all 


TABLE (5.20): NUMBER OF WORKERS IN RELATION TO TOTAL WAGES AND VALUE OF 
INPUT AND OUTPUT IN 1949 AND 1950 


number of workers amount received 


: value of value of 
industry (thousand) by workers input output 
Rs. (crore) Rs. (crore) Rs. (crore) 
1949 1950 1949 1950 1949 1950 1949 1950 — 
(1) (2) (3) (4) (5) (6) (7) (8) (9) 
1. cement 18.54 16.15 1.59 1.57 6.83 8.47 “12.37 16.96 
2. chemicals 31.09 31.08 3.10 2.99 11.04. 13.15 22.92 27.17 
3. cotton textiles 778.46 711.07 89.18 76.06 244.79 258.96 381.33 366.15 
4, iron and steel 65.00 63.22 8.69 10.29 29.38 30.55 58.04 57.43 
5. jute textile 302.47 283.44 25.94 — 22.60 109.51 101.58 144.68 148.17 
6. paints and varnishes 3.09 2.38 0.23 0.15 2.87 2.45 4.50 3.49 
7. paper and paper board 24.46 28.96 1.94 1.70 8,91 9.57 13.69 16.18 
8. sugar 100.95 102.05 6.37 6.09 — 64.36 65.27 95.10 101.17 
9. tea 97.14 98.61 4.46 5.71 34.45 37.88 92.20 107.19 
10. tobaeco 24.08 25.89 1.38 1.46 25.12 28.97 35.32 39.42 
11. total (1 to 10) 1445.28 1362.85 142.18 128.62 536.56 556.15 860.15 883.33 
12. total of all 


industries 2424.00 2337.00 213.40 199.64 1072.82 1128.69 1571.39 1642.18 


13. per cent to total 


of all industries 59.62 58.32 66.63 64.42 50.01 49.27 54.74 53.79 
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industries comprised 2424 and 2337 thousand during 1949 and 1950 respectively. 
The amount received by workers in these two years stood for all industries 
at Rs. 213.40 and Rs. 199.64 crore. The total value of input was Rs. 1072.82 
crore in 1949 and Rs. 1128.69 crore in 1950 whereas the respective values of output 
were Rs. 1571.39 and Rs. 1642.18 crore. Corresponding to that in all industries 
for the years 1949 and 1950, the number of workers in the ten major industries 
formed respectively 60 and 58 per cent and the value of output 53 and 54 per cent. 
Per capita labour earnings in all industries were Rs. 975 in 1949 and Rs. 964 in 
1950, the corresponding figures in the ten major industries being Rs. 1053 in 1949 
and Rs. 1027 in 1950. Thus in the ten major industries the rate of earnings was 
low and the average annual earning high compared to all industries because 
the number of working days in the former case was 226 and 200 in the latter case. 


EMPLOYMENT VIS-A-VIS CAPITAL INVESTMENT 


5.53. The total invested capital in all industries was Rs. 1068.05 crore 
in 1949 and Rs. 1100.96 crore in 1950. The corresponding figures of fixed capital 
are Rs. 538 and Rs. 583 crore respectively. As against this the total number of 
persons employed in all industries stood at 2714 thousand in 1949 and 2627 thousand 
in 1950. The figures of output per employed person was Rs. 5790 and Rs. 6251 in 
1949 and 1950 respectively. These figures for all industries are shown in 


Table (5.21). 
TABLE (5.21): NUMBER OF PERSONS EMPLOYED, PER CAPITA COST OF 


EMPLOYMENT AND GROSS OUTPUT PER EMPLOYED 
PERSON IN ALL INDUSTRIES IN 1949 AND 1950 


item unit 1949 1950 

(1) (2) (3) (4) 

]. number of employed persons (thousand) 2714 2627 
2. total invested capital (Rs. crore) 1068.05 1100.96 


3. per capita cost of employment in terms of 


invested capital (Rs.) 3935 4191 


538.00 582.76 


4, fixed capital (Rs. crore) 
. per capita cost of employment in terms of 
Ё fixed capital outlay E (Rs.) 1982 2218 
6251 


6. gross output per employed person (Вз.) 5790 


5.54. The invested capital per employed person in all industries works 
out to Rs. 3935 in 1949 and Rs. 4191 in 1950. The fixed capital per employed person 
recorded an increase of 11.9 per cent in 1950 over 1949. The output per employed 


person stood at Rs. 5790 and Rs. 6251 during 1949 and 1950, which meant an increase 


of 8 por cent in the latter year. 
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5.55. An investment in the ten major industries of the order of 55.14 and 
54.68 per cent during 1949 and 1950 respectively of the total invested capital in all 
the industries went towards providing employment broadly to 60 per cent of the 
total employed persons in all industries. The fixed capital per employed person 
formed 87 per cent in 1949 and 90 per cent in 1950 of that in all industries. The 
cost of providing employment to one person in these ten industries varied on an 
average from Rs. 3700 to Rs. 4000. The corresponding output per employed person 
lay in the range of Rs. 5400 to Rs. 5900. In the order of capital-intensity per 
employed person, the ten major industries can be arranged as follows : iron and steel. 
cement, paper and paper board, tobacco, paints and varnishes. tea, sugar, cotton 
textile and jute. The highest fixed capital investment among the ten industries 
was in iron and steel, followed by cement, paper and paper board, chemicals, tea, 
tobacco, sugar, paints and varnishes, cotton and jute. This is shown in Table (5.22). 


TABLE (5.22) : CAPITAL AND OUTPUT PER EMPLOYED PERSON IN 1949 AND 1950 
——M M M Á— Á—M —QÀ——  —À € M He M —— 
invested capital 


| fixed capital рег output per 
industry per employed person employed person employed person 
Rs. Rs. Rs. 
1949 1950 1949. 1950 1949 D ЕТ 
(1) (2) (3) (4) (5) (6) (7) 
l. cement 8057 8990 5197 5027 | 5086 a 9394 
2. chemicals 8000 7824 4786 4659 6047 7068 
3. cotton textiles 2796 3112 965 1168 4628 4844 
4. iron and steel 8521 9042 5483 6073 7286 7304 
5. jute textile 2059 2213 838 958 4536 4947 
6. paints and varnishes 6920 6754 1845 1720 11532 10913 
7. paper and paper board 7701 7343 5409 5447 4910 4019 
sugar 4579 4519 1625 2106 1481 1982 
9. {еа 1034 4511 4637 8186 9378 
10. tobacco 7976 6508 2337 2313 12899 13352 
11. total (1 to 10) 3727 4023 1725 1993 5444 5902 E 
12. total of all industries 3935 4191 1982 2218 5790 6251 


5.56. Thus the invested capital in the ten major industries per employed 
person, recorded an increase of 8 per cent in 1950 over 1949, while the fixed capital was 
marked by an increment of the order of 16 per cent in the latter year. The output 


per employed person on the other hand witnessed a rise by about 8.4 per cent in 
1950 over the previous year. 


CAPITAL FORMATION 


5.57. The fixed capital in all industries was Rs. 538 crore in 1949 and 
Rs. 583 crore in 1950. This meant an increase in the fixed capital outlay in all 
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industries taken together of the order of Rs. 45 crore or 8.36 per cent in course of 
one year. For the ten major industries on the whole the fixed capital increased in 
1956 over 1949 by Rs. 26 crore or by 9.5 per cent. Individually for each of these 
teu industries there was a general increase in the figures of fixed capital in 1950 over 
1949, 


PRICE FACTOR 


5.58. All that has been said above regarding increase in the value of pro- 
duction would mean little real gain as the recorded higher attainments in 1950 
may be due to the higher prices in the latter year. The price factor has to be taken 
into account for final appraisal. It would be well to remember that the index number 
of wholesale prices was 308 for 1947. After the partial decontrol experiment in that 
year, the index number of wholesale prices came to be stabilised around 380 in 1948. 
During the year 1949 this index stood at 385. Thereafter, towards the beginning of 
1950, в downward trend in prices was noticeable as a result of a general readjustment, 
of prices all over the world. But this falling tendency received a setback owing to 
the outbreak of war in Korea in June 1950. Prices went up rapidly and wholesale 
price index for the year 1950 stood at 409. Prices in 1950 were thus higher to the 


tune of 6 per cent over those in 1949. 


5.59. Assuming that the wholesale index number is applicable to the field 
under review, we find that for all industries taken together as also for ten major 
industries there was very little increase in the real value added per worker. Certain 
individual industries, however, such as cement and jute recorded a real increase of 
the order of 75 per cent and 35 per cent respectively. Other industries where real 
increase in the value added per worker was broadly in the neighbourhood of 12 per 
cent in the year 1950 were chemicals, tea and sugar. On the other hand, the real 
value added per worker in 1950 as compared to 1949 declined sharply in paints and 
varnishes followed by cotton textiles, iron and steel, tobacco, and paper and paper 
boards. 


5.60. The rise in prices of the order of 6 per cent in 1950 as compared to 
the price level in the year 1949 can hardly be taken to usher in anything like a boom 
in the real sense for industries. It is true that owing to the particular tempo of the 
ent, chemicals and cotton textile industries 
dustries found it difficult to expand 
f the favourable market 


Korean war, the products of jute, cem 
received a larger assured export market, but the in 
their production adequately so as to make full use o 


conditions. 
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SAMPLE FACTORIES AND THE TOTAL 


MI : 


1949 AND 1950 


ee 


industry 
(1) 
1. wheat flour 
2. rice milling 
3. biscuit making 
4. fruit and vegetable processing 
5. sugar : vacuum pan factories 
6. sugar : gur and jaggery rofineries 
7. sugar : gur factories 
8. distilleries and breweries 
9. starch 
10. vegetable oils : oil mills 
ll. vegetable oils : hydrogenated 
12. paints and varnishes 
13. soap 
14. tanning 
15. cement 
16. glass and glassware 
l7. ceramies 
18. plywood and tea chests 
19. paper and paper board 
20. matches 
2]. cotton textiles : spinning mills 
22. cotton textiles : composite mills 
23. cotton textiles : power-loom mills 


———— 


woollen textiles 
jute textiles 


chemicals (including drugs etc.) 


total 


aluminium, copper and brass : primary products 


aluminiura and brass ete. : secondary products 


iron and steel : primary products 


iron and steel: other than primary products 


bicycles 

sewing machines 
producer gas plants 
electric lamps 


electric fans 


и 
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number total 
of sample number 
factories of factories 
(2) (3) 
8 115 
35 1574 
10 70 
6 26 
12 171 
7 y 
10 253 
8 66 
6 22 
58 1282 
9 34 
8 48 
11 82 
17 115 
11 17 
31 177 
12 76 
11 47 
15 52 
14 79 
38 84 
232 312 
32 359 
16 67 
100 104 
27 210 
3 3 
26 254 
5 5 
17 198 
6 17 
6 6 
3 3 
10 10 
9 43 


NATIONAL SAMPLE SURVEY ;: MANUFACTURING INDUSTRIES, 1949 AND 1950 


INDUSTRY-WISE TABLE SHOWING THE NUMBER OF SAMPLE FACTORIES AND THE TOTAL 
NUMBER OF FACTORIES СОМА IE SAG INS АДДА 195 Соне? 


total 
industry драг. сеш. ES. 
factories factories 

а) (2) (3) 
36. general and electrical engineering repair workshop 1 210 
37. E 5 » o» manufacturing 114 1730 
38. footwear and leather manufacturing 8 31 
39. rubber and rubber manufacturing 10 99 
40. enamolware 6 26 
4l. hume pipes and other cement and concrete products 7 57 
42. asbestos and asbestos cement products 3 5 
43. bricks, tiles, lime and surki manufacturing 23 219 
44. lac " 15 78 
45. turpentine and rosin 2 2 
40. plastic (including gramophone records) 8 33 
47. saw milling 9 305 
48. woodwaro (including furniture) 19 149 
49. toa manufacturing 3 78 1080 
50. tobacco products 19 150 
51. groundnut decorticating ete. 26 382 
52, printing and bookbinding 57 915 
53. webbing narrow fabrics 8 86 
54, hosiery and other knitted goods 27 240 
55. thread and thread ball making 12 36 
56. textiles, dyeing, bleaching etc. 19 173 
57. clothing and tailoring 11 27 
58. cotton ginning and pressing 119 2767 
59. горе making 4 8 
60. silk and artificial silk 33 313 
61, jute pressing 15 39 
62, electricity generation and transformation 16 225” 
63. automobiles and coach building 38 421 
64. ship building and ship repairing 14 60 
65. aircraft assembling and repair service 7 18 
66. railway wagon manufacturing 3 4 
67. textile machinery and accessories 17 103 
68. unspecified industries 96 1398 
69. total of all industries 1742 17377 

4—4 
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APPENDIX Ш 
PRINCIPAL PARTICIPANTS 


This Survey was initiated in the Directorate of Industrial Statistics under 
instructions from the Honorary Statistical Adviser to the Cabinet and the Chairman 
of the National Income Committee and the collection of the data was also made by 
the staff of that Directorate. Processing and analysis of the data were done and the 
report (on the basis of a draft prepared by Shri Hari Charan Ghose, then Chief 
Director of National Sample Survey) prepared by the Indian Statistical Institute. 
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THE NATIONAL SAMPLE SURVEY 


NUMBER. 12 
A TECHNICAL NOTE ON AGE GROUPING 


FOREWORD 


Biases in age returns occur extensively in India as in many other countries. 
Attention to this problem has been given in the Indian Censuses; and special group- 
ings have been adopted from time to time for age tabulations and the smoothing of 
age returns. No systematic study of age distortion has, however, been made so far. 
It became necessary to consider this question in connexion with the analysis of the 
demographic data collected in the National Sample Survey (NSS). This technical 
note gives the results of special investigations undertaken by Ajit Das Gupta and 
his colleagues in the Indian Statistical Institute for a period of about three years on 
basis of the NSS and Census age returns, special experiments, and contemporary 
field studies. | 
2. The heaping of age returns has been studied in this report for the three 


components : 

(1) digit preference (as such, without the effects of estimation error and 

age bias): 

(2) estimation error (as such, without the effects of age bias); and 

(3) age bias; 
with a view to isolate the influence of each of these elements by itself (some amount 
of overlap was, however, unavoidable), and to build up the most efficient set of group- 
ing from the knowledge so obtained. 


3. The conventional 0-4: 5-9 quinary grouping [connoted in the present 
note by 0 : 5] was found to be relatively inefficient for the NSS data; and the set 2: T 
came out to be most efficient for important age-income segments of the population : 
this set also seemed to give a more balanced distribution of the group errors. 

4. The superiority of this set was also brought out by other special investiga- 
tions made by D. B. Lahiri and presented in the paper “Recent developments in the 
use of techniques for assessment of errors in nation-wide surveys in India" at the 
International Statistical Conference, Stockholm, 1957. This set had been found to 
be the most efficient for age returns in the Uttar Pradesh Census of 1951; in the 1931 
Census Report also the 2: 7 grouping had been recommended after an analysis of 
the age in individual years on traditional lines. No detailed examination could be 
made for the 1941 Census age data as the tabulations уеге раѕей on the two per 
cent Y-sample. In the Census of India 1951, Paper No. 8, 1954, some detailed 
examination of the age data of Uttar Pradesh led to the 2: 7 set being described 
as a “standard” grouping; but the 3: 8 set was recommended as “proper” for 


reasons not clearly understood. 
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5. Experience of sample surveys conducted in India suggests that with 
greater caré and interviews at depth, which are not practicable in the Census, it is 
possible to make improvements in the age returns. 


6. The analysis in this report was restricted to the specific objective in view. 
Other aspects of the quality of population data as obtained through a Census or 
through the NSS have not been considered here. Studies are, however, going on; 
and the systematic under-reporting of the population in the young age group 0—14 
in the 1951 Census was, for example, examined іп NSS Draft No. 14, "Some charac- 


teristics of the economically active population" on the basis of a comparison with 
age distributions of NSS data on population. 


11 October 1958 P. C. MAHALANOBIS 
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THE NATIONAL SAMPLE SURVEY 
NUMBER 12 
A TECHNICAL NOTE ON AGE GROUPING 


This Report, A Technical Note on Age Grouping, was prepared by the Indian 
Statistical Institute and is being published in the form in which it was submitted to 
the Government of India. The views contained in it are not necessarily those of the 


Government of India.* 


SECTION ONE 


INTRODUCTORY 


1.1. The question of comparative efficiency of different sets of age group- 
ing arose in analysis of National Sample Survey (NSS) demographic data. The 
feature of heaping up at certain digits, ascribed to ‘digit preference’, and the resulting 
distortion of age returns were studied at some depth in this context. The examina- 
tion of the constituent data itself is no doubt of primary importance in deciding on 
an efficient set of age grouping, but it was felt that the precise nature of the complex 
of factors underlying the age distortions had to be understood clearly before the 
question could be properly tackled. Advantage was, therefore, taken of some ex- 
periments contemporarily organised to investigate the interplay of these factors. 


1.2. The results of the investigation and the conclusions arrived at are 
set down in the following sections. The conventional 0-4 : 5-9 grouping, symbolised 
in the present note as 0 : 5, was found relatively inefficient for the NSS medium and 
the most efficient set 2: 7 was adopted in grouping ages for the purpose of internal 
analysis and also for the purpose of presentation. The departure from the con- 
vention itself seemed to call for sufficient justification ; this note was prepared to 
provide the necessary logical foundation. The findings are of course of wider impli- 
cation. 


1.3. Summary findings: The digit preference was examined in isolation 
from other estimation errors in the Estimation and Extra Sensory Perception 
(E & ESP) Study 1954 and the examination extended to actual age data. The digit 
preference as such was seen to have little effect on age record, the estimation errors 
being by far the dominant factor in the Indian situation where ages were mostly 
recorded from guess. West Bengal Special Demography (WBSD) Study 1954 for 
example disclosed that definite evidence of age, including a definite statement of 
the date of birth and that of children, was available only for one out of six persons, 
while for about half the population the ages were recorded just from guess. 


* The draft report (Number D. 16) was submitted to the Government of India in December 1956. 
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1.4. The digit preference comprised primarily in a tendency to Баар to 
the middle of the digit array 0, 1, ...... ‚8,9: a liking for a run of consecutive digits 
and a dislike to repeat digits, for convenience called the second order of digit pit 
ference, were also found. The digit preference might be significant in situations 
where no major distorting factors entered. 


1.5. Estimation errors on the other hand produced the familiar pattern 
of rounding up at digits ‘0’ and ‘5’. The analysis of the estimation error suggested 
that apart from the errors of rounding up, there could be a bias to over-estimate. 
In WBSD Study, both the ages as stated by the informant independently and as 
assessed by the investigator from the evidence available on his best efforts, were 


recorded, along with the type of evidence available, the rating of statement and the 


rating of assessment. The age-assessed was identical with the age-stated in about 


3 out of 4 cases; but for the remaining, age-assessed w. 
twice or thrice as often, more often in the middle age range. The age-assessed series 
however did not appear to be of any better quality than the age-stated series and 
over-estimation in assessment was indicated. Due to general ignorance of age, the 
age assessed by the investigator is usually recorded and the Census age data also 
supported the finding. The bias to over-estimate the 
Bengal Household Comparative (WBHC) Study 19 
population of NSS 4th round and the Study w 
years. Significant over-estimation in recorded ages appeared in WBHC Study; 
the bias actually started as one of under-estimation in the young age range which 
changed to progressive over-estimation with increasing age, resulting in overesti- 
mation in the aggregate. 


as higher than age-stated nearly 


age was confirmed in the West 
55, where the ages of the common 
ere recorded after a lapse of three 


1.6. The third basic element distorting age returns, the age bias, involving 
conscious mis-statement of age, was difficult to locate from internal analysis alone, 
particularly in situations like India where estimation errors are much larger in 
dimension. 

1.7. А modified simple measure of cone 
index of concentration, was evolved in this not 
distortion; a relative range measure of deviation and 
were suggested to enable better analysis and com. 
to note that the average deviation percent of age у 
of the order of 0.5. A new technique w 


entration, on the lines of the Myers’ 
e, leading to an index of aggregate 
a group efficiency index 
parative study. It wag interesting 
aS nearly uniform in all age ranges, 


Е | as applied to determine the most efficient 
set of age grouping, as the group efficiency index vari 


ed for different age segments 
and socio-economic classes of the same population, 
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NATIONAL SAMPLE SURVEY : NOTES ON AGE GROUPING 


SECTION Two 


THE PROBLEM 


2.1. While grouping of data is often necessary for presentation and proper 
comprehension, in the field of age statistics this necessity may be utilised to evolve 
a set of grouping that reduces the total group errors to a minimum. In a country 
where ages are as a rule definitely known and reported, the question of the most 
efficient set of age grouping is not so important, as the group errors will be small 
for any set. But in a country where ages are generally not definitely known and 
the heaping up at certain digits at the cost of others is very marked, the selection 
of an efficient set of grouping is very important. 


2.2. "This"feature of heaping up or concentration at certain popular digits 
was usually referred as ‘integer bias’ in the past and sought to be attributed to bias 
for certain ‘preferred’ end-digits like 0 and 5. In recent years, however, this is being 
treated more as an error of rounding off. A good deal of work on the subject of 
‘integer-bias’ or ‘round-off’ errors in age reporting has already been done, specially 
in the national census publications of different countries. Age is an important factor 
not only in the understanding of the vital flows that condition population dynamics 
but also in sizing up most other population characteristics of socio-economic interest; 
the need for getting at the best estimate of the true group-age distribution is thus 
obvious. 


2.3. А priori considerations suggest that the heaping up in age returns 
might be the combined effect of the following elements : 


(1) digit preference (as such, without the effects of estimation error and 
age bias); 


(2) estimation error (as such, without the effects of age bias); 
(3) age bias. 


An effort was, therefore, made to grasp the effect of these elements in isolation and 
to build up the most efficient set of grouping from the knowledge so obtained. The 
study of these forces in isolation was difficult and some amount of overlapping could 
not always be avoided. 


2.4. From a priori considerations again, i& would appear that the effect 
of digit-preference can extend over the unit cycle of end-digits 0, 1, 2, ...... ,9 and 
thus only small displacement errors independent of the age range, should result from 
it. Estimation error could similarly be expected to produce displacements, small 
in the earlier age ranges but gradually increasing as age advances. The digit pre. 
ference and the estimation error again, from their very nature, would be of cyclical 
nature over the array of end-digits. The age bias, arising as it does from didus 
influences was more likely to have a few focal points at the crucial ages (specific 
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for different countries in a given period of time), apart from some general tendency 
to understate or overstate at particular age regions, without any cyclical characteris- 
tics : the pull of age bias is apt to be lopsided and to have a long arm. 


2.5. If ages are exactly known, stated and recorded the true distribution 
will of course be reproduced. When ages are exactly known but not correctly stated. 
age bias will obviously be the element responsible. If ages are known within a 
narrow margin, the digit preference may. conceivably be an important element; but 
when ages are not known, or only known to lie within widely separated limits, the 
estimation error is likely to be the dominant element. It is natural that more than 
one element will be found superimposed on the dominant clement in a practical 
situation, for example when the ages are only known to lie within widely separated 
limits, the limiting ages themselves will be liable to the influence of age bias. 


2.6. What usually happens in a country like India is that the age is un- 
known and has to be estimated from looks or from comparison with known events 
or relative seniority ranking: within the household or community. Such assessment 
of age has to be done by the field investigator or enumerator : in reality, an age band 
with its length depending on the type of evidence available, is consciously or un- 
consciously estimated by the investigator in the fivst instance and before the alloca- 
tion to an individual age within the band. Behind each of the recorded individual 


ages (falling in the category not definitely known) is, therefore, an estimation age 
band. 


2.7. In WBSD Study??!, among other things, infor 


À mation about the type of 
evidence available on age was collected, along with the age 


| as stated by the informant, 
the rating of the statement and the age assessed by the investigator in the field. 


Information about the type of available evidence is set in Table (2.1). 

2.8. It will be seen from Table (2.1) that in West Bengal, where the accuracy 
of age assessment might be the best for India?-?, year of birth of only about 15 per 
cent of the total population was definitely known. The ages could be estimated 
or known approximately in about 40 per cent cases; and for the balance of about 
45 per cent the age recorded was more or less guess-work estimates E in the 
definite class, documentary evidence of age was obtained in пе li ibl ты , "tion 
of cases, particularly in the city area. This position should be Si i á p фи " 
considering the age returns in the Indian Situation. каша, 


sampling fractions wero adjusted in 
ors. 43 investigators were employed 
up during AprilJuno in tho Study- 


а manner as to give uniform multipliers for the three agglomeration, sect 


in the experiment and about 1850 investigation-inspection days used 


22 As measured by the Index of Concentration evolved by + 


р he U.S. Bur 
adopted by the Indian Census; Census of India 1951, Paper No. 3 1954, p Jua Fe осна GHA 
› , p-4. 
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NATIONAL SAMPLE SURVEY : 


ABOUT AGE 
(NSS WBSD Study 1954) 


NOTES ON AGE GROUPING 
TABLE (2.1): DISTRIBUTION OF INDIVIDUALS BY TYPE OF AVAILABLE EVIDENCE 


type of evidence 


age-assossod hoarsay, related with ^ definito birth 
group guoss or dofinite or statoment cortificate total 
oyo- approximate of year or other 
estimate ages or of birth documentary 
events 
(1) (2) (3) (4) (5) (6) 
city (170 households) 
1. 0— 6 26 49 21 — 96 
(96) (27.1) (91.0) (21.9) (100.0) 
2. 1—16 40 40 16 — 96 
(96) (41.7) (41.7) (16.6) (100.0) 
3. 17—abovo 262 79 54 . 3 398 
(95) (65.8) (19.8) (13.6) (0.8) (100.0) 
4. all agos 338 168 91 3 590 
(95) (55.6) (28.5) (15.4) (0.5) (100.0) 
other urban (405 households) 
| т. 0—6 - 81 138 105 5 329 
| (%) (24.6) (42.0) (31.9) (1.5) (100.0) 
2. 7—16 137 199 60 8 404 
(%) (33.9) (49.3) (14.8) (2.0) (100.0) 
3. l7—abovo 581 423 60 20 1084 
(96) (53.6) (39.0) (5.5) (1.9) (100.0) 
4. all ages 799 760 225 33 1817 
(%) (44.0) (41.8) (12.4) (1.8) (100.0) 
rural (754 housoholds) 
- i. 0— 6 129 232 302 20 683 
| (%) (18.9) (34.0) (44.2) (2.9) (100.0) 
| D 
| М 7—16 322 399 132 24 SRI 
| (%) (36.7) (45.5) (15.1) (2.7) (100.0) 
| 3 17—above 5 207 
Е UEM 958 920 137 57 2072 
(90) (46.9) (44.4) (6.6) (2.8) ^ (100.0) 
4 alla 
| : gos 1409 1551 571 101 3632 
| (%) (38.8) (42.1) (15.7) (2.8) (100.0) 


[ 
| 2.9. Chart (1) gives the histogram representing the distribution of NSS 
4th round all-India urban sample population in individual ages, to demonstrate 
| the extent of age distortions involved. The seriousness of the situation will be 
| apparent when it is realised that in the NSS material used in the histogram 
| the population returned at the single individual age 60 is 53.4 per cent of the total 
| population returned in the age group 56-60 and 64.5 per cent of the total returned 
in age group 60—64; the respective proportions for the rural sector are 61.1 per cent 
j and 68.6 per cent. And the quality of age reporting in NSS medium, as would be 
anticipated, was somewhat superior to the general census standard. These facts on 
the type of evidence demonstrate how limited is the possibility of improving the 
| quality of the Indian age returns, perhaps for at least a generation to come. 
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2.10. Discussions will be confined in this note to uniform quinary groups, 
the smallest groups usually adopted in representation of age statistics. Analysis for 
a most efficient set of decennial groups will involve similar considerations, though 
extension of the group interval to cover the full cycle of digits makes the problem 
somewhat easier here. Systems of unequal groups (alternate ternary—septanary 
for example) are unusual and inconvenient for presentation purposes. In what 
follows accordingly by the set of most efficient grouping will be meant the set of uni- 
form quinary groups which gives the best fit to the true conditions over the whole 
range of life. The chronological age is only dealt with in this note. 

2.11. As will be apparent from subsequent discussions a certain set of group- 
ing may not be most efficient both in rural and urban conditions, and for different 
income slabs within the population. In the same manner, it is conceivable that 
one single set may not be equally efficient, say, for different economic activity seg- 
ments like earners and dependants, or for different universe of events like births and 
deaths. As treatment of different sectors of the population in varying sets of group- 
ing will disturb comparability, compromise is clearly called for. 
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OL. Б : 


SECTION THREE 
DIGIT PREFERENCE 


3.1, The nature and extent of the digit preference was studied in isolation 
in the first instance. In the E & ESP Study 1954 conducted in the Indian Statistical 
- Institute (ISI), a cluster of five 7-digited numbers without the 3 middle digits wan 

given to the workers for supplying the suppressed middle digits by guess. The distri- 
bution of the central digit so supplied by 220 workers, expected to be free from any 
extraneous bias apart from the digit preference as such*. is given in Table (3.1). 


TABLE (3.1) : FREQUENCY DISTRIBUTION OF THE CENTRAL MISSING DIGIT SUPPLIED 
b BY GUESS 


(ISI, E and ESP Study 1954) 


digit 
g - total 
0 1. 2 3 4 5 6 q 8 9 
frequency 86 59 133 117 115 134 127 140 94 95 1100 
(95) 


(1.8) (5.9 (12.1) (10.6) (10.4) (12.2) (11.6) (12.7) (3.6) (8.6) (100.0) 


ı with expected frequency 110 in oach cell, x2— 54.8, significant at 1%. 

3.2. It was apparent that true digit preference comprised a tendency to 
keep to the middle of the digit array 0, 1, ...... ‚ 9, within the range 2-7, | 
fall of the actual frequency from the expected ( 5 
digits, 0, 1, 8, 9-1, This pattern of digit prefere 
‘integer bias’ with strong pulls for 0 and 5 and 8 
from the analysis of age returns. 


The short- 
110) was quite marked at the end- 


nee is altogether different from the 


maller pulls for 2 and 8 that emerges 
As will be seen later, 


the heaping up at 0, 5, 2 
and 8 arises mainly from estimation err ich i Pag Na ý 
the distortion in age reporting. 


3.3. Other interesting facets of dig 
E & ESP Study?4. One was the disinclination 
the other the preference for a run of consecw ng digits; we shall call this 
the second order of digit preference. Table (3.2) giv Г 


98 the distribution of two 
consecutive filled-in digits in the E & ESP study (the missing third bad ей places 
of 7-digited numbers). у 


=s 


it preference were 
to repeat, 
tively risiy 


disclosed by the 
digits in one sequence and 


3-1 Tt is rel 
Incidentally, 
(numerical) © 


vant to mention that the effect o 
an understanding of digit preference 
opying and computational mistakes.. 


ption (ESP) was found negligible: 
employed to check 


f oxtra-sensory Perce 
e may bo usefully and control 
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FREQUENCY DISTRIBUTION OF DIGITS SUPPLIED BY GUESS IN THE FIRST 
TWO CONSECUTIVE MISSING DIGIT PLACES 
(ISI, E and ESP Study 1954) 


—————————Є———— 
second missing digit 


TABLE (3.2): 


first 

dig ^ 0 1 2 3 4 5 6 2 8 9 total 
0 9 7 9 4 6 5 1 2 3 3 49 
1 14 9 29 13 10 10 7 7 6 3 108 
2 12 8 6 22 16 13 18 12 9 7 123 
3 6 4 14 6 33 20 15 1s TL l4 141 
4 11 4 19 24 5 38 32 13 7 8 161 
5 9 5 12 14 13 9 24 17 14 6 123 
6 5 5 16 14 7 10 7 40 11 7 122 
Я 5 5 7 5 7 Ti 13 7 21 13 94 
8 5 3 11 6 9 10 16 3 21 88 
9 10 9 10 9 9 8 6 $ 9 13 91 

total 86 59 133 117 115 134 127 140 94 95 1100 


3.4. Table (3.3) shows separately the frequencies of selected digit pairs 


of Table (3.2). The expected frequency in each cell of the table on basis of 


random selection of digits is 11. Some of the lowest frequencies occurred with 


the repeated digit pairs 00, 11, 22, ...... ‚ 99, the total frequency of this set of ten 
On the other hand, most 


repeated digit numbers being only 74 against expected 110. 
of the highest ferequencies occurred with the set of eight consecutively rising digit 
pairs 12, 23, ..:... ‚ 89, the total frequency of the set being 228 against expected 88. 
TABLE (3.3) : FREQUENCY DISTRIBUTION OF SELECTED PAIRED CONSECU- 
TIVE DIGITS SUPPLIED BY GUESS 
(ISI, E and ESP Study 1954) 


consoeutivo rising run of digits repeated digits 
selected paired — froquency selected paired frequency 
digits digits 
(1) (2) а) 9 
eS 
12 29 00 a 
23 22 11 9 
34 33 22 E 
33 
45 38 44 5 
56 34 55 Р 
67 40 96 1 
T 77 
78 21 Ыз i 
89 21 88 
eee. c 
а a e 
total 228 5 a 
average 28.5 aes : 

- x ylli 
ith expected froquency 11 in p сөп AN 19.3 with T 
DIEA M Sensi 1, ipt 
cant at Eo. al oan aUo: 
ob AN С RENE о A a E 
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3.5, Similar preferences for run of consecutively rising digits and dislike 
for the repeated digits were found in the analysis of other filled-in places of the num- 
bers. Part of the short-fall in frequencies of the repeated digit pairs clearly resulted 
from the attraction for the consecutively rising digit pairs, contiguous to them. The 
preference is also disclosed in the frequency distribution of all the three filled-in missing 
digits given in Table (3.4). Arranged in the table in the order in which they occurred 
in the filled-in E & ESP schedules, the progressive diagonal shift of the maximum 
frequency range down the table is apparent. 


TABLE (3.4: FREQUENCY DISTRIBUTION OF ALL THE THREE 
; CONSECUTIVE MISSING DIGITS SUPPLIED BY 
(ISI, E and ESP Study 1954) 
— 


third fourth fifth 
digit. place placo placo 

a) (2) (3) (4) 
0 49 86 85 
1 108 59 90 
2 123 133 88 
3 141 117 139 
4 161 115 95 
5 123 134 143 
6 122 127 118 
7 94 140 102 
E АЗ 94 116 
g 9 95 124 


3.6. "These inherent likes and dislikes in the run o 
enter the reporting by the informant and the assessment an 


gator. 'The reported individual age distribution of Bengal 
and of West Bengal and U.P. (males) of Census 1951 эы. (males) of Census 1911 


this second order of digit preference persisted in the de e E v» اوا‎ 
disturbance in age returns from estimation error was much "is LO | ii it 
preference of the second order being non-cyclic was not masked =e ш i Ж ds 
stronger cyclic disturbance. The numbers returned at particular ages Nas » 2 d 
tion could not be compared with the graduated frequency for is ur uem HE 
examination and a method had to be devised to estimate the ex is qum 0 

on elimination of the second order of digit preference. pected frequency 


f numbers will naturally | 
d recording by the investi- 


3.1. Chart (2) shows for Bengal ( 
form of smooth curves, the distribution o 


рз 1911 Census population, in the 
e numbers 9 in indivi 
Pe ps returned in individual ages 
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3.8. It was argued that the-form of the frequency curve in the region of 
an individual age in question, subject to the primary digit preference and estimation 
error (both of which were of cyclical nature within the digit array) but not subject 
to the second order of digit preference, will be intermediate to the form of the curve 
in similar end-digit regions either side, ten years of age up and below. In other 


PRE Q UEN сү 


Q 2 2 3 4 5 6 , 
Е ND D: име Ач 
CHART (2) : Frequency curves of numbers returned at each end-digit in decennial age groups 
г (Census 1911, Bengal males). 
words, denoting the actual number returned at individual age x, by the symbol т, 
the form of the individual age frequency curve in the 210,1 : osi : дочь Tegion 


would be intermediate between those rendered by the niw- : pes 
: regi if the second order o 
"106-1444: ANd тоолун: онази: Mrolossuts PIONS, 


digit preference were not there. In effect the assumption allows only for Шеш 
distortions and thus eliminates the non-cyclie influences. The distortion from age 
bias is therefore also not allowed for; but for simplicity the age bias which is 
localised can be left out of account for the time being. 
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3.9. The simplest estimations were made in pursuance of the above assump- 
tion and constants determined from two linear simultaneous equations on either side 
were applied to the repeated digit age in question to estimate the appropriate expect- 
ed frequency. Thus, to estimate E(n,) the expected frequency at the repeated digit 
age r = 100 + v the formula E(n,) = An, 44- Bin, 44). Was tried, the constants 


A,, B, being determined from the two equations 7 9— Aq, aoa Вто РТ ZU 


and тодо А4101 + Bits ao 0+1): Table (2.4) gives the numbers actually 


returned and the numbers expected at the repeated digit ages. for Census 1911 


Bengal (males), and Census 1951 West Bengal (males) and Uttar Pradesh (males). 
TABLE (3.5): POPULATION RETURNED AT REPEATED DIGIT INDIVIDUAL 
AGES IN CENSUS AND EXPECTED POPULATION ON ELIMINA- 
TION OF SECOND ORDER OF DIGIT PREFERENCE 
(Consus of India) 
_ 


———— —— 


oxpocted porcentage 
number returned frequency doviation 
ago in consus (000) (000) (пх) "пт 
x ny Е(пг) — x 100 
ng 
(1) (2) (3) (4) 
Bengal (males) : Consus 1911 
2 11 1310 1606 22.6 
2. 22 2156 2304 6.9 
3. 33 374 363 —2.9 
4. 44 202 218 7.9 
5. 55 1017 1096 7.8 
6. 66 39 41 Gl 
West Bongal (males) : Census 1951 
в iut 2395 2799 16.9 
2. 22 7 y 
© 2 2547 2474 —2.9 
з. 33 1397 1491 6.7 
4. 44 1250 1292 3.4 
j 55 1068 1119 4: 
6. 66 212 232 91 
1 an U. P. (males) : Consus 1951 

2. 22 Ps FER 36.3 
3. 33 pasi ga —3.8 

2 3 $ 
г ы 1871 1918 a 
р E 4991 5332 0.8 
E 374 406 EY. 


3.10. The expected frequency was as a rule higher than numbers returned: 
the expected frequency was always higher for the repeated digit pair: à 44 above: 
though small derivations in reverse direction appeared in the ж ead SEE + á & p 
pair ages. As stated earlier, the age bias could not be ‘longed и on : hod 
of estimation used and the element of age bias perhaps disturb нее һе de P. 
quency of the earlier repeated digit pair ages rendered by the me s sk qe init 
based formula might have given better balanced estimates. But сө айдап їп 


support of the hypothesis that the second order digit preference persists in age reporting 
was clear enough from the analysis done above. | ilis 


3.11. It seems probable that the inflation noticed at age 60 in the age returns 


of many countries may be, at least partly, due to t «I „ЖЕ 
number БЕ. . he dislike of the repeated digit 
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SECTION Four 


ESTIMATION ERROR 


4.1. Some results obtained in the E & ESP Study relevant to the estima- 
tion error are discussed first. The E & ESP Study schedule also contained a cluster 
of five lines, of which the lengths were required to be eye-estimated and recorded 
to the second place of decimal in terms of an unconventional unit of length ‘L’ shown 
on the body of the schedule***. The conditions were such that the second decimal 
figure could be nothing better than pure guess. The distribution of the second decimal 
figure supplied by 222 workers is given in Table (4.1). 


TABLE (41): DISTRIBUTION OF (1) THE SECOND PLACE AFTER DECIMAL OF THE EYE- 
ESTIMATED LENGTH OF LINES AND (2) THE END-DIGIT OF AGE OF ALL- 
INDIA RURAL SAMPLE POPULATION AGED 40-ABOVE 


(ISI, E and ESP Study 1954 and NSS 4th round+- 1952) 


digit 
item total 
0 1 2 3 4 5 6 Т 8 D Е 
1. socond decimal 
placo of е 
estimate 414 25 12 45 38 344 58 41 48 25 1110 
(%) (37.3) (2.3) (6.5) (4.0) (3.4) (310) (52) (3.7) (4.3) (2.3) (100.0) 


2. ond digit, ago 5 7 
завет S 2326 276 552 279 297 1265 301 234 352 165 6047 


(concontration)t-? (31.3) (4.3) (8.8) (4.7) (5.2) (22.7) (62) (5.1) (7.8) (3.9) (100.0) 


4.2. The concentration at ‘preferred’ digits is now of the familiar pattern 
found in age returns, though more accentuated here for the digits ‘5’ and ‘0’. The 
striking similarity between the frequency distribution of the digit in the second 
decimal place in the estimated lengths of lines in the E & ESP Study and the end 
digit of age of the all-India rural sample population aged 40 and above, suggested 
that most of the error in age returns is that of estimation when the unit digit of 


age was just a matter of guess. 


4.3. On further analysis, a tendency to over-estimate was also disclosed 


by the E & ESP Study which was suggestive. The actual aggregate length of the 
3L; but the mean of the estimates 


five lines correct to the first place of decimals was 6. 
ates is 


recorded by the workers was 6.7L. The distribution of the recorded estim 
given in Table (4.2). The over-estimation was highly significant; as against only 
27% who came on the side of under to correct estimation, 7396 over-estimated the 
aggregate length. The major peak of the distribution of estimates was at 6.6L. 


4-1 Appendix 0. 
*'? All the NSS 4th round rural 

8, 15 & 16 split at tho village lovel for operational convenience. | 
4-3 Tho measuro of concentration is a percentage distributio 


tablos of this noto cover tho six 1/16th part samples 3, 4, 7, 


n of the end digits defined in 


para 6.5. 
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TABLE (42): DISTRIBUTION OF EYE-ESTIMATE OF THE AGGREGATE LENGTHS OF A 
CLUSTER OF 5 LINES (ACTUAL AGGREGATE 6.33L) ROUNDED TO THE FIRST 
DECIMAL PLACE 


(ISI, E and ESP Study 1954) 


пи 
length (L) frequency length (L) frequency length (L) frequency 

(1) (2) (1) (2) 
1s 5.1 1 Ir. 6.4 16 
2 5.4 1 12. 6.5 20 
3 5.7 1 13. 6.6 29 
4. 5.8 1 14. 6.7 11 
5. 6.0 11 15. 6.8 15 
6. 6.1 6 16. 6.9 14 
T 6.2 15 17. 7.0 16 
8 6.3 24 18. 7.1 1 
19. 7.2 7 
9. sub-total: 6.3 below 60 20. 7.3 3 
(96) (27.0) 91. 7.4 6 

10. sub-total: 6.4 above 162 د‎ 32. 

(%) (73.0) 
moan = 6.69 o2 = .2627 sm = .0434 


4.4. Such tendency of over-estimation (or under-estimation) has been found 
by other operators in different experiments!-*. 


4.5. It was therefore decided to investigate how far the bias to over-estimate 
or under-estimate might have entered the assessment of age. Actual age data of 
the WBSD Study were analysed to investigate this. Table (4.3) gives the distri- 
bution of the individuals covered by the Study in age-assessed minus age-stated 
groups, separately for the city, other urban and rural sectors*-5, 


TABLE (4.3): DISTRIBUTION OF INDIVIDUALS IN AGE-ASSESSED MINUS AGE-STATED 
CLASSES, UNDER EDUCATION STANDARD BREAKDOWNS 


(NSS, WBSD Study 1954) 


5 ee EE ш 
city other urban rural 
preme (170 households) (405 households) (754 housoholds) 
minus total illi- liter. matri- total illi. liters i ili. ter. matri- 
ago-stated terate ate culate terae "s sis ii nus e culato 
(1) 0 G (0 © © (п в @ ao ap a» 019) 
1. age-assossed 
= ago-stated 495 194 236 65 1552 749 то 9 тоа 20 
(9%) (83.0) (86.2) (81.9) (84.4) — (80.2) (81.6) (90.5) (94.9) (778) (76.3) (81.1) (05.2) 
2. age-assessed 
< ago-stated 20 8 10 2 70 59 y i gee 28 50 — 
(9%) ($4) (3.6) (35) (26) (3.9) (6.4) (14) = (18) (83) (6.5) c 
3. age-assessed 1 
> ago-stated 75 23 42 10 178 110 63 07 
E (96) (12.7) (10.2) (14.6) (13.0) (9.9) (19.0) (8.1) (81) me (18.4) az) (4.8) 
4. total 590 225 288 77 1800 оз 784 о 2 
8 3522 2637 864 
9 100) (100) (100) (100 10 Й Ea 0) 
(96) (100) (100) (100) (100) (100) (100) (100) (100) (100) (100) (100) (199) 


$ *. Examples of such bias of over-estimation in selection of ‘representative’ units have poot 
cited by Frank Yates in “Sampling Methods for Censuses and Surveys” (1953) at pp. 12-13. 

4-5 Ago was not stated in less than 1% of the casos only (most of it in the rural sector) and who 
not-stated cases have been left out of the tables. : 
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4.6. Тһе age-assessed minus age-stated was positive two to four times 
more often than it was negative : that is, a higher age was assessed two to four times 
more frequently. “With regard to this feature the male female differential was not 
significant, but the relative proportion of higher age-assessed tended to go up among 
non-Hindus and among Hindi-speaking population in the West Bengal field“. The 
proportion, was, if anything, rather higher in city area and among the educated, as 
Table (4.3) shows. 

4.1. "The investigator's assessment of age is usually the only thing available 
and recorded in census and surveys in India but in the WBSD Study the investigators 
were definitely instructed not to render any such assistance in age statement and 
clear evidence is thus furnished that the investigator assesses a higher age in the sum 
than the informant states. The question is whether the investigator was trying to 
correct in his assessment a real under-statement of age by the informant, or if in 
the aggregate there was no such under-statement in the process of assessment, the 
age was over-estimated by the investigator. An attempt was made to answer this 
question from further examination of the WBSD Study material. "Table (4.4) gives 
the distribution of individuals in age-assessed minus age-stated classes for different 
categories of rating of statement. 

TABLE (44): DISTRIBUTION OF INDIVIDUALS IN AGE-ASSESSED MINUS AGE-STATED 
CLASSES UNDER RATING OF STATEMENT CATEGORIES 
(NSS, WBSD Study 1954) 


rating of ago statement 


ago-assossod othor urban rural 


city 
minus (170 househods) (405 households) (754 housoholds) 
age-statod 


guess appro- difinite guess appro- dofinito guess appro- dofinito 


Mate ximate _ = ee 7 bt.‏ ا 
EM NR UNI (2). б “ү @® — (9) (7) 8 _® OO‏ 
age-assossod 180 153 T0977 350 513 689. 733 992 1008‏ .1 

и (79.3) (81.8) (92.0) (74.9) (89.7) (97.3) (63.3) (75.4) (96.3) 
9. ago-assossod 
S Ee аз n 5 4 38 28 4 142 180 12 
(%) (4.8) (27) (23) (8.0) (4.5) (0.6) (12.2) (91) (D) 
3. ago-assossod " 
> Вава 36 29 10 84 79 15 284 — 204 27 
Iu 20) у (158) (155) (57) (17.8) (128) (1 (24.5). UE e 
. total 227 187 176 473 620 708 1159 
_(% 000.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) 


4.8. For all categories of rating, the proportion of ages assessed higher to 
ages assessed lower is fairly stable, about 2 for all sectors combined, rather a little 
higher for the category of rating definite. A progressive fall in the proportion was 
to be expected in passing from guess to definite category of rating of age statement 
if the investigator, in his assessment, was correcting under-statements for which the 
guess and approximate categories offered a much bigger scope. The actual pattern 
brought out suggests over-estimation in age-assessed. \ 

4.9. Тһе assumption is, however, implicit here that the E of statement 
has been Proper. As to the reliability of the rating, Table (4.5) gives the frequency 


+6 Appendix 1, 
7l 
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of end-digit ‘0’ in the age-stated range 23-62; a systematic fall in the concentration 
at end-dight ‘0’ with upgrading in rating category is observed. It is permissible 
to take this as a very rough test of the validity of rating done. 


TABLE (4.5): CONCENTRATION AT END-DIGIT ‘0° IN AG 
RATING OF STATEMENT CAT 


ATEMENTS UNDER DIFFERENT 
;ORTES 


(NSS, WBSD Study 1954) 


дд 
—_ ЬКЬ”ЬООСЬУЬ 


city " other urban rural 
(170 households) (405 households) (754 households) 
population concon- population concon- population concen- 
rating of aged 23-62 tration aged 23-62 tration agod 23-62 tration 
statement 
return- col(2) return- col(5) rotrun- со1(3) 
ed at total —-—x100 ed at total X100 ed at total x 100 
ond-digit col(2) end-digit col(6) ond-digit col(9) 
‘0’ ages ‘0’ ages ‘0’ ages 
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) 
1. guess 37 161 23.0 71 226 31.4 160 596 26.9 
2. approximate — 14 84 16.6 70 341 20.5 143 665 21.5 
3. definite 5 51 9.8 36 197 18.3 39 219 17.8 
4. total 56 296 18.9 177 764 23.2 342 1480 23.1 


4.10. Table (4.6) gives distribution of the gap between age-assessed and 
age-stated in broad age-assessed groups. 


TABLE (4.6): DISTRIBUTION OF INDIVIDUALS IN DIFFERENT AGE RANGES UNDER AGH- 
ASSESSED MINUS AGE-STATED GROUPS 
(NSS, WBSD Study 1954) 
м 
ügo-assessod (yoars) 


ago-assessed city other urban 
i 170 1 holds Я ч rural 
Mere A (17 — о E (405 housoholds) (754 households) 
0-16 17-61 62- 0-16 17-01 62- 0-16 17-61 62- 
above above &bovo 
PI i ИИБ © © m ЖЭИЕ @ 00 
1. -11 below - - 1 = 2 i т S 
2. -10 to —6 - 1 = 2 3 я : E 2 
3. - 5 to -3 - 3 2 5 14 à a 28 2 
4, — 20-1 5 8 1 20 23 si a £ s 
5. —1 & below 5 11 4 25 42 = 
(%) бї eU) ao (34) — (£2) — (4g) бт) бл) ал) 
6. 0 182 296 17 684 815 53 : Е 
(%) (94.8) (794) (68.0) (933) (81.1) (855) 14 mm. ulis 
7. 1% 2 5 42 2 20 82 = = 3 
8. 3t0 5 - 21 2 4 54 6 т m E: 
9. 6to 10 = 3 = = 8 ы 3 83 3 
10. 11 above - i “ = 4 x _ 7 2 
11. 1 & above 5 66 4 24 148 6 88 u^ 
(96) (2.6) (17.7) (16.0) (3.3) — (14.7) (9.7) (5.7) (1.2) (26.6) 
12. total 192 373 25 733 1005 3 1558 - 
(%) (100.0) (100.0) (100.0) (100.0) (100.0) (100,0) (100.0) Hae 7 600.8) 
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1.11. While a tendency to overstate age in the old age range is well kn 
and is likely to have brought down the proportion of the total каан шк Е: 
the total under-assessment in the old age range 62-above, the even break of н 
assessment and under-assessment іп the young age range 0-16 is not so easil 
explained : the margin available for under-statement is however limited in the ie 
age range by the age attained. The spread up of the gap between age-assessed and an 
stated is interesting; less than half of the deviations exceed 2 years of age and most 


of it fall within the limits of +5 years. 
4.12. In the WBSD Study, as already indicated, information was collected 


about the type of evidence available to the investigator in assessment of ages, on 
> 


his best efforts. Table (4.7) is an alternative presentation of the information obtained 
in this respect; both the rating of assessment and type of evidence on which 


the assessment naturally rested were combined here to give composite categories 
for assessment evidence types. Table (4.7) shows that definite evidence of age was 
available in 18-30% of cases only; and, as was seen from Table (2.1), a definite state- 
ment of age by the informant was behind most of it. Considering that 16-19% of 
the individuals covered were in the age range 0-6, the grave weakness in the field 
of the age assessment is quite apparent. The proportions with definite assessment- 
evidence type were higher than the respective proportions with definite evidence of 
age available given in Table (2.1), and the gap increased progressively in passing from 
the city to the rural sector. The age-assessed series could not be taken as a better. 


approximation to the true ages, in the circumstances. 


TABLE (4.7): DISTRIBUTION OF INDIVIDUALS IN ASSESSMENT-EVIDENCE TYPE CATE- 
GORIES UNDER SEX BREAKDOWNS 


(NSS, WBSD Study 1954) 


кошка E other urban rural 
(405 households) (754 households) 


city 
assossmont- (170 households) 
ica aaa total malo female total male female total male female 
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) 
1. guess 276 154 122 372 204 168 883 434 449 
| (90) (46.8) (45.2) (49.0) (20.5) (21.0 (19.8) (24.3) (23.2) (25.5) 
2. approximate 191 108 83 1025 521 504 1743 876 867 
(%) (32.4) (31.7) (88.3) (56.4) (53.8) (59.4) (48.0) (46.7) (49.3) 
3. definite 123 79 44 420 244 176 1006 564 442 
(95) (20.8) (23.1) (17.7) (23.1) (25.2) (20.8) (27.7) (30.1) (25.9) 
4. total 590 за 24 1817 969 848 3630 ПВ. UES 
(%) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) 
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SLE da + ifferent ratings 
4.13 The values of concentration at end digit 0’ under different | мл ч 
„19. m ' the age-assesse series 
Oi а ment, similar to those shown in Table (4.5) but for the age-assessed s "s 
aes abs i le (4 8). С mpi rative study of the concentration makes 
now, ат n in Tab 5 0 a 


ies 1 ‚ age-stated 
lear that the quality of the age-assessed series is no better than the age-state 
cle y 
series. 


TABLE (4.8): CONCENTRATION AT END-DIGIT ‘0° 


IN AGE-ASSESSED SERIES UNDER 
DIFFERENT RATING OF ASSE 


SSMENT CLASSES 
(NSS, WBSD Study 1954) 


city other urban rural i 
(170 households) (405 households) (754 houroholds) ин 
population concen- population coneon- population rn 
rating of agod 23-62 tration aged 23-62 tration agod 23.62 Mm 
assessment ч g 1(8) 
return- col(2) roturn- col(5) roturn- col(8) -— 100 
od at total — —x100 edat total X100 od at total D ^ 
end digit col(3) end digit col(6) ond digit eol(t 
‘0’ ages *0' agos ‘0 agos EM 
А 3 " ) 
а) @ B (4) 6) — (9) (7) (8) — (9) (10) 
1. guess 39 150 36.0 49 207 23.7 128 442 29.0 
2. approximate 14 101 13.9 97 437 92.9 204 889 23.0 
3. dofinito 3 46 6.5 26 140 18.6 44 273 16.1 
4. total 56 297 18.9 172 784 31.9 376 — 1904 33.5 


4.14. The probable reason why ‘() 


' and ‘5? happen to be 
; in that order, can be examined here Rounding off at ‘0’ 


preference as one digit, the unit place, is cut out, by such 
level of approximation digit ‘5’ gets the n 


the most favoured 
naturally gets first 
Approximation: and at that 


for even digit over odd: 
‘0’ but not far away r 
and not under it 
differently, 


Sually a slight preference 
Tf an estimate is aboye 
ed and recorded under ‘0 
om ‘O’ it will be recorded 
The Preference for digit *8' has 


t is far away 
and will then rather 


exactly a similar explanation, 


4.15. 


"Census of India 1951— 
Madras (male 


) popul 
‘over-statements’ ate 
only to adjoining age. 
are derived from the 


Age Siy 
ation showing M ro * gives а diagonal Table for the 
ach age from 6 ев ‘under-statements’ and 
5, and the graduated indivi © envisages notional transfers 
v ; и я 
Б centred roun EV frequencies taken as true 
‘5’. The set of grow 


take into account ¢ most Preferred digits ‘0’ and 
In the age range 27-67 of this table, the Average о the tendency ОЁ over-estimation- 
= Ver-stateme. 


nt of age is 29 per cent 
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as against the average under-statement of about 20 per cent. In the age range 6—67, 
the proportions of similar average over-statement and under-statement are 22 per 
cent and 19 per cent respectively. Supplementary evidence of the tendeney of over- 
estimation in age return is thus disclosed by the Census reporting itself. | 


4.16. The analysis done above shows that the estimation error really falls 
into two parts. one arising from rounding off approximation and other from over-esti- 
mation. The element of over-estimation missed attention in the past and the estima- 
tion error was equated to the error of rounding off. 

4.17. In NSS experimental West Bengal Household Comparative (WBHC) 
Study 1955 the same sample households as of NSS 4th round were revisited after 
а lapse of about 3 years to measure changes in living conditions during the intervening 
period : this opportunity was utilised to investigate further the over-estimation bias 
in age reporting and the ages of the household members were independently ascertained 
again. Comparisons between ages reported in NSS 4th round and WBHC Study 
Showed some interesting features. Table (4.9) gives the distribution of the deviation 
between the age reported in the WBHC Study and the age expected on the basis of 


the three-year old NSS 4th round age return, 


TABLE (4.0: FREQUENCY DISTRIBUTION OF THE NUMBER IN DIFFERENT AGE GROUPS 
BY ADJUSTED DIFFERENCE IN AGES 


(NSS 4th round 1952 and WBHC Study 1955 : 650 rural households) 
ج ج حح‎ ЗЕРНО 
— ago (years) WBHCS 


age difforoneo : 


WBHCS "me = — 
(4th round 4-3) 3—9 10 —19 20—29 30—39 40—above all agos 
РЕ (1) (2) (3) (4) (5) (6) m 
5 4 116 
1. 158 128 8 б 
4) Ш) 800) 0818) (фт) 89 8279 
0 a - 
126 134 125 244 720 
9, Г 1 pe (22:6) (29.4) (33.2) (41.0) (48.4) (35.3) 
б 22. 4 : 
З. moan differenco 1.28 1.81 2.32 is 3.17 2.41 
141 96 144 i 
uS Lu GU at) 820 GL) _ GRO) o 
: | 5 —2.54 —2.79 ы 
5. mean difforonco 1.67 1.61 —1.98 = 2.09 
6 498 403 305 504 2.042 
EM . 0 awo 000) 0000 01000) (100.0) 
7 Е چ‎ 0.09 0.21 0.74 0.18 
* Mean difference —0.24 —0.01 EXE 
—9.99, Боан о. 
E. with mean difference 0.18, t=3 


WBHC Study were in sum somewhat higher than 
тее-уеат old NSS 4th round returns, with an 


is was for the persons common in the 
Average over f 0.18 years. But this w а 
te hi нь са 0 Lien 4th round survey and ru ا‎ since, were 
Surveys: t orn j rants, The average r T 
Naturally ыз a coa the comparison, apart from the migr 8e report 
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ed ages of the two surveys did not differ si 
that while there was a big comparative und 


[ Parts 1 & 2 
gnificantly. It was interesting to observe 
er-state 
range 3-9, the difference narrowed in the age range 10-19. then changed to slight 
overstatement in the next higher age range 20-29 and to inereasing over-statements in 
the later age ranges 30-39 and 40-above, 


The over-statement observed for the total 
range was thus the contributory effect of high Over-statements 


ment in the youngest present age 


at the advanced ages. 


no reasons 
The Teporting population was 
Were not subject to repeated surveys j > inter- 
| с | peated ‹ eys in the inte 
vening period n í 
gp or otherwise condi The WBHC Study results thus 
under-estimation of аду. | 
ranges below 20 and distinct 


ance of ages in the young age 
with a moderate over- 


ance at the later adult ages, 


Thus in the aggregate the 

е ages Egregate Over-estimate may remain 

i | ges were Over-estimateq 
e in such Proportion 


most in the previous survey, 
2 bsequent increa, 


Ver-estimation of those 
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SECTION Five 
AGE BIAS 


5.1. The possible location and nature of the age bias in different population 
formations can often be known from their cultural traits. Tendencies to 
exaggerate ages in the threshold of adulthood and after retirement, conscious under- 
statement of age in the young-adult range by the females in some culture and con- 
scious over-statement of age to attain legal majority, to escape military service and 
for old-age pensions or just to impress as outstandingly old, are some of 


to qualify 
urces of the bias experienced. The mores and the laws of the land 


the common so 
work behind the bias, and changes in them deflect the pattern of the bias : the pattern 
is usually quite stable over time in each country until the relevant laws change. The 


age bias is of specific location and is thus distinguished from the general estimation 


bias, from which it also differs in nature. 

5.2. Longitudinal comparisons across consec 
for migration if significant, and for the digit preference and the estimation error, 
might disclose the pattern of age bias: except for the digit preference of the second 
order, these preferences and errors of cyclical nature get automatically allowed 
for to a large extent in comparisons over à series of decennial censuses. Tf reliable 
birth-death registration and migration statistics be available, the total distortions 
reconciliation of successive census results with the interven- 
ing movements and the extent of age bias broadly assessed. Such techniques are 
data are grossly defective, as in India. Sample 
set of investigators, who travel down 
vidence of age, is another alternative 


utive census intervals, allowing 


could be determined by 


however not helpful when the relevant 
verification of ages in the field with a superior 
to the birth certificate or other best available e 
method employed to locate age bias or rather the total distortions in age recording. 
It may be reiterated that the cardinal point of interest in the problem of age grouping 
is the total distortion, and the elements leading to it are studied to get a better 
understanding of the position. 

5.3. It should have been possible to spot 
examining the run of the ratios that the numbers returned at each end digit of age 
constitute of the total returned in the successive decennial age ranges say. But 
such examination is also not very helpful for the Indian situation where the big dis- 
tortions from estimation error mask most other features. Table (5.1) shows for the 


NSS medium the ratios 


the age bias at analysis stage by 


Niwtu x100 


9 
> Nyovtu 
u-0 


up to age 79, for each end digit и in the whole sequence of age ranges 


9 
> p= 0; 1% saos; 7. 
Nyovtu > MOTA 
u=0 
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TABLE (5.1): RATIO OF NUMBERS RETURNED AT EACH END-DIGIT TO TOTAL NUM- 
BERS IN THE SUCCESSIVE DECENNIAL AGE RANGES 


(NSS 4th round 1952, All-India rural sample: 28,918 persons) 


decennial age rango 


EN 0—9 10—19 20—29 30—39 40—49 50—59 60—69 70—79 | 
(1) (2) (3) (4) (5) (6) (7) (8) (9) 
0 10.6 15.2 18.7 26.6 31.4 38.6 49.2 51.0 
1 10.1 8.1 6.1 4.6 4.5 5.0 1.6 3.8 
2 10.9 14.4 13.5 11.8 9.5 9.2 8.9 7.7 
3 10.8 8.7 6.9 5.8 5.0 4.5 3.8 4.4 
4 9.7 9.4 9.1 5.8 5.6 4.6 4.7 3.0 
5 10.9 10.4 18.4 20.4 23.1 19.6 18.7 19.7 
6 10.3 11.3 8.4 8.3 6.2 5.0 2.7 3.3 
7 8.9 5.8 5.4 4.5 4.9 3.8 2.1 25 
8 10.2 11.7 9.8 8.4 6.8 6.5 3.6 3.0 
9 7.6 5.0 3.7 3.8 | 3.0 3.2 1.7 1.4 
total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 


5.4. Тһе ratios for any one end digit of age could have been expected ordi- 
narily to form a smooth progression over the successive decennial age ranges, on 
which the impact of the age bias (ignoring the comparatively small influence of the 
second order of the digit preferences) will have produced marked local disturbances. 
But in any case, the history of the particular population growth and the loc 


al mores 
and laws to be turned to for confirmation : past fluctuations in birth-de: 


ath experience 
may also produce isolated tracts of accumulation, particularly in small populations. 
Sharp rises in the ratios at ages 25 and 60 are observable in Table (5.1). The pull for age 
60 was serious enough in the NSS medium to take the quinary age group 60-64 total 
beyond the 55-59 total. The examination of the run of the ratios as a method of 
Spotting possible age bias is not satisfactory when the concentration at particular 
end digits is so high and the relative concentrations between the end digits change 
so violently as in Table (5.1). 


5.5. The moot question in the present case was whether these sharp rises 
in the run of ratios came rather from mounting concentration at these preferred end 
digits. The age bias operated in a manner so as to accelerate and decelerate the flow 
of numbers returned in the immediate neighbourhoods of certain crucial ages: yet 
another approach of spotting the location of age bias suggested itself from this. 
Examination of the first differences of the ratios over a few consecutive end digits 
in neighbouring decennial age ranges should show up the acceleration and decelera- 
tion effects. Table (5.2) gives the first differences of the ratios in Table (5.1). 
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FIRST DIFFERENCES OF THE RATIOS OF NUMBERS RETURNED AT 
EACH END-DIGIT AS SHOWN IN TABLE (5.1) 


(NSS 4th round 1952, all-India rural sample) 


м— eee 
decennial age range 


end 


digit 0—9 10—19 20—29 30—39 40—49 50—59 60—69 70—79 
» (1) (2) (3) (4) (5) (6) (7) (8) 
Er 4.6 3.5 7.9 4.8 7.2 10.6 ioe 
1 —2.0 —2.0 zo —0.1 0.5 S —0.8 
2 3.5 —0.9 —1.7 2.8 0.3 0.3 1.2 
3 —2.1 11.8 =1.1 —0.8 —0.5 —0.7 0:6 
4 sul. B —0.3 —8.8 —0.2 eu 0.1 —1.7 
5 —0.5 8.0 2.0 2.7 —3.5 E070 1.0 
6 1.0 —2.9 —0.1 —2.1 -—1.2 —2.3 0.6 
7 dol —0.4 —0.9 0.4 —1.1 —1.7 0.6 
8 135 —1.9 —1.4 —1.6 —0.3 —2.9 —0;6 
9 —9.6 S 0.1 —0.8 0.2 —1.5 — 0.3 


5.6. The differences for a number of consecutive end digits immediately 
5.6. The nce: 3 : DE 
before age 60 are all negative and comparatively larger and Ше differences generally 
} fon 3 this area. The conditions in the immediate neighbourhood of age 25 
sign in в area. - ч и є 
ануса ap distinctive. Some age bias is however now suggested for age 16, which 
are even less dis > But the evidence is far from conclu- 


3-15. 
ге gai the cost of ages 13 ur 
appears to have gained at th the preferred digits are no doubt 


i i centrations at 
ive; and mounting con i | à loub 
pa 1 d ible for this lack of conclusiveness. And specific localised age bias is 
mainly responsi 


also relatively milder in India. 
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SECTION SIX - 


MEASURES OF CONCENTRATION AND DISTORTION 


6.1. It is obvious that the extent of concentration at each digit and the 
resulting aggregate distortion have to be measured before any efficient set of grouping 
could be built up. The concentration at each digit can be measured by comparison 
of deviations of actual numbers returned at individual ages from the expected 
numbers in the corresponding ages obtained by graduation. It will be assumed 
however that no graduation of the age returns has been done, as the problem of group- 
ing ceases to exist if a good unbiased graduation. influenced by subjective choice of 
the operator, is already available; any set of grouping will be equally efficient when 
built up from such graduated numbers. 


6.2. At one time, the total numbers returned at each end digit used to be 
compared to a tenth of the population, to get a measure of the integer bias and of the 
total distortion, on the assumption that all the end digits should occur with equal 
frequency if there was no integer bias. King (1916) then pointed out that the starting 
integers 0, 1, 2, ...... got additional weightage in that order in such comparison*:!, 
The difficulty was resolved by Myers (1940) who used a blended population for the 
purpose in his index of concentration*-?; each digit was put successively at each place 
from first to tenth in the component populations by Myers, so that they got balanced 
weight in the resulting blended population. Myers started with age 10 and showed 
that the average concentration values for the United Kingdom 1911 Census age 
returns yielded by his method were remarkably close to the values obtained by King 
by comparison of the numbers returned against the graduated numbers. 


6.3. A much simpler measure of concentration was evolved in connection 
with the analysis of distortion in NSS age returns. In a normal population structure, 
where the numbers alive gradually fall with age, digit ‘0’ will show up a higher con- 
centration than really attached to it if the total numbers at different end digits of age 
were all compared to a tenth of the total population starting with age ‘0’. But this 
difficulty will be cireumvented if for each different end digit of 
starting with that particular digit was only taken into account : 
proportion of the number returned with end digit ‘0’ 


age, the population 
thus, for digit ‘0’ the 


ibi ь ages to the total population aged 
zero and above, for digit ‘I’ the proportion of the number returned with end digit ‘1’ 


ages to the population aged one and above, and so on, be taken. The proportions 
for different digits will not usually add up to unity under this method, but when 
reduced to a unit base the proportions should give proper relative 
concentration for each of the individual digits. 

6.4. The digit ‘0’ may still get a slight weightage under this method owing 
to the incidence of the high infant mortality: but that will not be material; and the 
practical consideration is there that under-reporting of infants, a common feature 
of censuses and surveys, tends to offset this. 


measure of 


9*1 Journal of the Institute of Actuaries, Vol. XLIX, p. 301. 
9? Transactions of the Actuarial Society of America, Vol. XLI, Part 2, No. 104, pp. 402-415. 
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6.5. The measure of concentration suggested above and the index of aggregate 
distortion derived from it, are defined below in algebraic symbols. If m, > danois 
the number actually returned at age 10v-+-w then ‘эп,’ the measure of бшка 
at digit “н? is given by 


>, Too eu 9 9 
a x 10, where d » п. + X 3 O 


x - tou v=] “=0 


And ‘1? the index of aggregate distortion is given by 


9 
Jg 59 ]m,—1| 
0 


It will be clear that ‘J’ will tend to zero in ideal conditions, if there were no distor- 


tion. 
6.6 The measure of concentration for each digit and the index of aggregate 


distortion of the NSS medium for rural India, urban India and the city of Calcutta 
are given in Table (6.1). The measure and index for the population aged 40-above 
in the rural sector, for males and females separately in the urban sector (where the 


sex differential was found highest), and for the population of household-income 
per month in the city of Caleutta have been actually presented 


an idea as to how the concentrations vary from one popula- 
The measures and index for Census 1951 in Uttar Pradesh 


group Rs.100 and below 
in the table, to convey 


tion segment to another. 

individual age returns are also shown for comparison. 

TABLE (6.1): MEASURES OF CONCENTRATION AT INDIVIDUAL END-DIGITS AND INDEX 
OF AGGREGATE DISTORTION IN AGE RETURNS 


(NSS 4th round 1952, Caleutta Employment Survey 1953, and Census of India 1951) 


NSS 4th round : all India Caleutta Employ- 
ment Survey 
ural urban$:3 (1,056 households) 

80 5 Census 
on (28,918 porsons) 28,715 porsons) - mo 
digit male female all ousehold U.P. 

an send ji = income income< (males) 
BESS 40:86018 groups Rs. 100 
per month 
(1) (2) (3) (4) (5) (6) (7) (8) (9) 
1.6 LM 1.9 2.5 1.9 
0 1.8 $4 ad 0:7 0.7 0.6 0.6 е 
: 9 `9 1:1 1.2 13 1.2 Т d 
2 1.1 n 0.8 0.8 0.8 0.8 0.6 07 
i 0:8 0:5 0:9 0.9 0.8 0.9 0.7 0.8 
$ 0-8 MES 1.4 14 1:5 1.8 1:7 1 
5 1.6 0.6 0.9 0.9 0.9 Ns os 0.9 
$ йр 0.5 0.8 0.8 0.8 2$ б 0.6 
T 057 0.8 T.i 1.0 1.1 035 220 1.0 
5 i" 0:4 0.7 0.7 0.6 5 | 0.6 
total 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 
a ' R 
index of А F © 
a E р 2. : . 3. 
distortion 3.2 6.8 2.4 2.4 a 
samples of NSS 4th round were used for this note. 


of total sixteen part- 


dia 1951, Paper No. 3, 1934, pp. 36-37. 
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6.7. It will be seen that the urban pattern of concentration differed from 
the rural pattern on the one hand and the city pattern on the other. The most 
efficient set of grouping for the urban age returns need not therefore be the most 
efficient for the rural and the city sectors. The fact that not only the pattern of 
concentration of the household-income group Rs. 100 and below per month of the 
city of Calcutta was of different nature from the general city pattern, but that the 
distortion in their age reporting was even greater than in the general rural population, 
is somewhat unexpected. This apparently results from the break-up of the family and 
the drift from original community moorings of individuals in the lower income group; 
and the consequent failure of the applicability of a relative seniority ranking scale 
within the household and the community. The higher average age of the city 
population particularly in the lower income level, could also be a contributory 


cause. The implications of these differentials will be dealt with further in the next 
section. 


6.8. It is easy to see from the distribution of numbers returned at individual 
ages (the diagrammatic representation of Chart (1) for example) that the force of 
concentration is comparatively low in the young age range; as could have been anti- 
cipated on « priori grounds, it gradually increases with advancing age. In the 
beginning, pulls of even over odd is most effective; then the pulls of ‘4’ and ‘6’ fade 
out giving place to increased pull for the middle digit ‘5’. Pulls of °0’ and ‘5’ dominate 
the middle age range, with ‘0’ building up as age advances further. The position 
is complicated by existence of special pulls of the nature of age bias. 


6.9. While the mounting nature of the pull of concentration has been 
noticed by earlier operators, no relative measures of deviation for the different 
age ranges appear to have been used so far. The root mean square deviation in 
decennial age ranges, calculated as the square-root of the sum of the squared devia- 
tions between the numbers actually returned and the expected frequencies, can 
provide such measures®°. The difficulty which perhaps weighed with the oper 


ators 
in this field was that of estimating the expected frequencies. 


But simple assumptions 
like linear fall in expected frequencies between quinary pivotal values (estimated 


from the numbers actually returned by suitable grouping) should serve the purpose. 
The set of expected frequencies produced by even such crude assumptions would 
take account of the general shape of the true distribution and thus give satisfactory 
relative measures. The measures of concentration, taken along with such relative 
range measures of deviation could only give a proper insight into the extent and 


spread of the distortion. 


6-3 A number of other alternative measures of doviation are of course possible: ono could be 


ә 


> Gio i — rionn)? giving the x?-analogue of the distribution in the decennial age range 10» to 104-9. 
T 
u-n 100+4 


co 
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6.10. In algebraic symbols, i, the relative range measure of deviation of 
the decennial age range 10» to 10v--9 is defined as 


yx E 
] 9 
/ (fori очаи)? 
\ 10v+u 10v+u 


р m 
i, =— = 


v 
С Илои 
и=0 


where rj,,, is the expected number at age 10v--u. 
The expected number tiop at pivotal age 1004-р,(р = 4,9 for the 2:7 


grouping adopted), was taken as 


n 


1 
Tioetp = Е > ои ; and the expected numbers at other age, 
U=p—2 
U—D ,. 2 2: о 
7 (Poss l10vt25)* where, u = p--1, p+2, рі. 


токы = Move p— Б 


6.11. The relative range measures of deviation in the successive decennial 
at individual ages in NSS for rural sector are 


anges for the population returned 
The progressive increase in the measures of deviation with 


out in columns (2) and (3) of the table. The root 
mean square deviation per cent of age (obtained by dividing the deviation per 


individual by the middle age of the range for simplicity and multiplied by 100) are also 
shown in the table: the interesting fact that the average deviation per unit of age 


attained is nearly uniform in all this part of the table. 


age r 
given in Table (6.2). 
advance of age is clearly brought: 


age ranges emerges from 


RANGE MEASURES OF DEVIATION IN DECENNIAL 


AGE RANGES 


amples : 28,918 porsons) 


TABLE (6.2) : RELATIVE 


(NSS 4th round 1952, all-India rural s 


M — "aviation por doviation percent; 
of age 


ago range individual 

malo fomale male female 
(1) (2) (3) (4) (5) 

0.03 0.02 0.61 0.43 
E je. 0.09 0.09 0.56 0.58 
Ê 2 8b 0.15 0.15 0.61 0.59 
DEA 0.25 0.19 0.70 0.54 
= о, 0.29 0.25 0.63 0.56 
HERE 0.30 0.33 0.54 0.60 
А Es 0.37 0.50 0.57 0.76 
10—79 0.42 0.41 0.57 0.54 
м з, 0.36 0.58 0.42 0.69 
16. 099 0.56 0.48 0.59 0.51 
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SECTION SEVEN 
GROUPING EFFICIENCY 


7.1. The efficiency of a set of grouping is traditionally assessed by the 
difference between the sum of the actual concentrations at the end digits comprised 


in the group and the ideal values. Eps the efficiency index of the quinary set of 
4 

grouping 0 : 5 for example is given by sm,—5 : it is obvious that the complementary 
0 


9 
value Xm,—5 will be the same, with just the sign reversed. This method was not 


5 
altogether satisfactory in that the weight of numbers lay in the age range below 20 
in population formations like India, and the group efficiency was accordingly deter- 
mined to a greater extent by the pattern of concentration in this young age range : 
the chosen set of groups are however used throughout the span of life, and also for 
different socio-economic segments of the population, where the patterns of concentra- 
tion are different. 


7.2. It has been seen earlier how the distortions are comparatively small 
in the lower age range up to 20 and gradually increase with age. The relative effi- 
ciency of a set of grouping in the higher age ranges should thus provide a better 
indicator. It was therefore proper to decide on the most efficient set of grouping from 
a study of the behaviour of the various possible sets in different age ranges; the study 
should preferably extend to other socio-economic segments of the population. 
Table (7.1) gives the efficiency indices of the various possible sets of grouping in some 
important population segments. 


TABLE (7.1): GROUP EFFICIENCY INDEX OF DIFFERENT SETS OF GROUPING 


(NSS 4th round 1952, Calcutta Employment Survey 1953, and Census 1951) 


NSS 4th round : all-India 


Calcutta 
Empl b y £ 
sot of ga q1 urban з и 199. lU.P. 
grouping (28,918 porsons) (28,715 persons) (1,056 households) (malos) 
all aged aged male f 1 i 
ages 30-above 40-above ipis и T IM 
< Rs. 100 p.m. 
(1) (3) (3) (4) (5) (6) (7) (8) (9) 
l. 0:5 0.15 0.31 0.43 0.14 0.11 0.35 0.46 0.14 
2; 1:6 —0:10 —0.36 —0.44 —0.01 —0.16 —0.07 —0.27 —0.05 
3. 2:7 0.20 | —0.06 —0.24 0.21 0.07 0.17  —0.01 0.16 
4. 3:8  —0.24  —0.52 —0.62 —0.21 —0.27 —0.96 —0.49 —0.26 
5. 4:9 —0.08  —0.18  —0.31 0.08 0.09 0.00 —0.1 0.01 


84 


NATIONAL SAMPLE SURVEY : NOTES ON AGE GROUPING 


7.3. Table (7.1) demonstrates how the relative variations in the group effi- 
ciencies, spread out in higher age segments of the same population. The urban and 
Calcutta samples were not analysed further in age segments, but the Calcutta sample 
was examined in an economic segment; in the household income < Rs. 100 p.m. 
population segment the variation again scattered wider. The group efficiency indices 
of the Census 1951 U.P. individual age distribution (1% sample)?! is also shown 


in the table for comparison. 
7.4. The rural sector is by far the more important; but on the basis of all 
ages there was not much to choose between the different sets of grouping, though 
It has been pointed out earlier why 


the 4:9 and 1: 6 sets had low indices. 
7 set shows the definite minimum 


the all ages index was not satisfactory. The 2: 
indices at the higher age segments : it similarly has a definite minimum index in 


the lower income group, which again was by far the more important economic seg- 
The real test of efficiency is thus satisfied by the 2:7 set for which the index 
remains more stable and comparatively low in all the different segments, particularly 
where the indices for the other sets soar up : the index for this set also shows more 


assing through the population segments, whieh make for more 
and its complementary group. 


ment. 


changes in sign in p | 
balanced distribution of the group errors between it 


set of grouping was therefore adopted in analysis, inter- 


pretation and presentation of NSS data on age (as well duration). With a general 
ages with the end digits ‘0’ and ‘5’ were likely to draw 
о 


ages below and the 2: 7 set of grouping with the 
y laced towards the end of the groups, was 


f this aspect of the estimation error. 


" 


7.5. The 2:1 


tendency to over-estimate, the 

comparatively more from the А 
" О AMT T en “д 

maximum concentration digits *0 and ‘5’ p 


also efficiently constituted to take account 0 

7.6. Though the question of the most eigens set of grouping un e Indian 
Census was not in issue, examination and some discussion of rd 9 gowns 
in the Census medium became inevitable. The mediums of col = о in НЕВЫ 
and the types of errors are different for the Census and eS NSS; the population 
a d the relative efficiencies of the various sets of grouping could 


the same an 
ed to remain undisturbed as between them. 


In Census 1931 Report?” the 2:7 grouping was recommended after 
7.7 er mu А т H ۴ 
күбе lata in individual years on traditional lines. No detailed examina- 
се da : eer к 
s Numbers returned at individual ages in 


age data. З à 
C 951 were not available for all India. Analysis of concentration and of group 
ensus 1951 wer : 


ffi f the Census aterial for Uttar Pradesh (U.P.) the only State which 

e > 0 ne 3 s. 6.1 Ks 

e vie a Census population zone by itself. is shown т 7 ( Г : z (7.1) : in 

жан 1 r am к Tables”? some detailed examination of the age data of the same 
ensus ge 


is, however, 
at least be expect 


analysis of the 
tion was done of Census 1941 


age n 


-- d uses? ia cr Y pe „ 36-37. 
731 Census of India 1951, Paper No. 3, E 
"i „ 135. 
7-2 Census of India 1931, Vol. I, Part 1, A 3 
7:3 Census of India 1951, Paper No. 3, 1954, р. 2 
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State was done to decide about the most efficient set of grouping. In the Age Tables 
the 2:7 set has been described as the standard grouping, but the 3 ::8 set has been 
recommended in the same breath as the ‘proper’ set; the belief that the dominant 
digits ‘0’ and ‘5’ should draw nearly equally from either side apparently tipped 
the scales in favour of the 3:8 set. 


7.8. Primary grouping of the Census age returns in the 3:8 set however 
produced a saw-tooth distribution and a roller-type formula had to be applied to 
these quinary group values to get a smooth run. "This was achieved by taking the 
weighted average of the group frequency concerned and the two adjoining group 
frequencies. The smooth set of group frequencies ultimately operated on for gradua- 
tion purposes in Census 1951 thus rested on the assumption that the dominant digits 
drew from 7 individual ages on either side : such assumption looks stretched on the 
face of it. Table (7.1) showed clearly how the 3:8 set is the least efficient for the 
Indian situation. The problem of grouping efficiency exists even behind the opera- 
tion of graduation: good graduation can only flow from an efficient set of group 
pivotal values. 


7.9. It is interesting to note that if the numbers returned in individual 


ages in Census 1951 are grouped under the 2 : 7 set, the total deviations of the actual 
group frequencies from the expected (built up from the corresponding graduated 
individual frequencies) are smaller than similar total deviations of actual from 
expected under the 3 : 8 set of grouping; this was actually verified for the individual 
age distributions of Uttar Pradesh and Madras, special notice of which was taken 
in the Census 1951 Age Tables! to select the ‘proper’ grouping. When it is realised 


that the graduation itself is based on the 3 : 8 set, greater confidence is gained about 
the superiority of the 2:7 set. 


7.10. "Table (7.2) gives the comparative deviations of the actual numbers 
returned from the expected for the two competing sets of grouping 2:7 and 3: 8 for 
the Census 1951 (males) population of Uttar Pradesh and Madras'*-4, 


7.11. The problem of grouping has been dealt so far with reference to age 
returns in individual years. But there may be situations when collection in 


individual years is either not possible or advised. The considerations guiding the 
selection of the most efficient groups will be altogether different if ages are 


as a rule 
not known or cannot be estimated in individual years. An 


ingenious suggestion 
made by R. Bachi (1951) to meet a somewhat similar situation deserves special 


mention in this context. He advised that all series which could not be collected by 
individual years might be collected for individual end digits ‘0’ and ‘5’ only, and for 
part-groups of end digits 1-4 and 6-9, as dominant distortions will be disclosed 
thereby?*5. Bachi further suggests routine methods of allocating the numbers returned 
at each of the preferred end digits ‘0’ and ‘5’, to the two bordering part-groups : 


* Census of India 1951, Paper No. 3, 1954, pp. 36-37 and 68-69. 
5 Bulletin of the International Statistical Institute, 1951, Vol. XXXIII, Part IV, pp. 218-221. 
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allocation in equal parts, or in proportion to the weights of the part-groups, or in 
inverse proportion to the relative broad aggregate measures of concentration of the 
part-groups (with adjustments for declining numbers with age) are the alternatives 
proposed. 


TABLE (7.2): COMPARATIVE DEVIATIONS BETWEEN CENSUS NUMBERS RETURNED AND 
EXPECTED UNDER DIFFERENT SETS OF GROUPING 
eee 


grouping sot 2 : 7 grouping sot 3: 8 
age numbor number deviation age number number doviation 
group returned expected (2)—(3) group returned expected (2)—(3) 
(years) (000) (000) (000) (years) (000) (000) (000) 
(1) (2) (3) (4) (1) (2) ° (3) (4) 
Uttar Pradesh (males) : Consus 1951 

+ = ги = 

15 96 244 4241 3 

2. 64 3971 482 
3. 215 3505 354 
4. 410 3071 90 
5. 156 2745 114 

6. 75 2497 212 
7. 124 2238 163 

8. 37—41 1986 40 1908... 146 
9. 1737 27 1710 : 157 

10. 1538 4s aou ES Cn 

11 1084 98 1117 156 

12. 908 37 809 113 
13. 492 90 530 104 
l4. total 30581 740 740 total 29829 1138 1138 
= viati ШТ 
avorago poreontago doivation 4,84 avorago percentage deviation Y 7.63 


Madras (males) : Consus 1951 


Te PE 57, 3607 3621 14 
1. 3701 3644 ШАА ү; $—12 3618 3339 270 
2 sao X 13—17 2040 3093 144 
3. 3541 oe 306 18—22 2613 2724 nm 
4. . 2500 28 67 23—21 2326 2366 40 
5. 2500 2 1 28—32 2233 2118 15 
6. 2172 2165 17 33—37 1793 1927 134 
2 1951 1968 49 38—42 1885 1730 155 
8. 1823 1114 13 43—47 1360 1506 146 
0; 1482 1555 85 48—52 1429 1269 160 
10. 1408 1320 116 53—57 857 1007 150 
11. 946 1062 58 58—62 871 151 120 
в та 78 63—67 400 4% m 
DM 4 
14. total 20632 96632 709 709 Tk Nue фын Nc a, oe 
. total 20602 aes a 
5.32 average percentage deviation 6.39 


Averago porcentago doviation " 
suggested by Bachi did not take account 
of age returns in individual years is 
him, and small sample analysis does not 


7.12. But the alternative allocations 
of the bias to over-estimate. Collection 


; d : Р b 
possible in the situation contemplated by г j ; г 
ши much extra cost or time. A sample of the population (or of the Census slips) 


can indicate the estimation and other biases and should yield шолу accurate 
estimates of concentration and group efficiency. The most efficient set of grouping 


could be determined in this manner, in advance of the general tabulation, which 


Д f grouping adjudged most effici 
may be done straightaway after that in the set of grouping adjudg cient, 
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APPENDIX 0 
PROFORMA OF SCHEDULE USED IN THE ESTIMATES AND ESP STUDY 


Estimation AND ESP srUDY 


The Demography Unit will be very much obliged if you kindly fill up the 
particulars below in your spare time and return the sheet to the Unit (Sri Samarendra 
Nath Mitra) early. 

БОШ NO CNN OPER. Па ees valise ann 
I. In terms of ‘L’ a new unit of linear measurement specified below, eye- 


estimate the lengths of the following five lines to two places of decimals and record 
the estimates : 


L 


ч 


New standard unit Estimated length 


(0.00) 


(3) (2) 


(4) (2) (5) (4) 


II. The middle 3 digits of the following seven digited numbers are missing : 


please complete the numbers by filling up the middle blank space with the digits 
that you think, on your first guess, might have been there - 


1) 93 85 
(2 27 47 
(3) 45 90 
(4) 12 08 
(b 66 19 


Thank you! 


TO 


19 6 $5 
(Ajit Das Gupta) 
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NePESWAR М 
DETAILED TABLES 


TABLE 1(1): DISTRIBUTION OF INDIVIDUALS IN AGE-ASSESSED MINUS AGE-STATED 


r CLASSES BY RELIGION 
¥ ° (NSS, WBSD Study 1954) 
“Е 
—————— 
age-assessed city other urban rural 
E. А | (170 households) (405 households) (754 households) 
total Hindu — others fotal Hindu others total Hindu others 
(1) (2) (3) (4) (5) (6) м (7) (8) (9) (10) 
1. age-assessed € 
= age-stated 495 422 73 1552 1296 256 2733 2160 573 
[A 2 (893) ` (86.5) (71.6) (86.3) (87.0) (82.3) (7.6) (804) (68.5) 
2. ago-assessed " » 
< ngo-stated 20 15 5 70 53 17 274 184 90 
(96) (3.4) (1 (9 (3.9) (8.68) (5.5) (7.8) (6.9) (10.8) 
3. uge-assessed c » - TE 
> age-stated 15 51 24 178 140 38 515 342 173 
t % (12.7) (10.4) (23.5) (9.9) (9.4) (12.2) (14.6) (12.7) (20.7) 
4. total ~~ 590 488 102 1800 1489 311 3522 2686 836 
(96) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) 
UC ERIT ЕЕ 


INDIVIDUALS IN AGE-ASSESSED MINUS AGE-STATED 


TABLE H2): DISTRIBUTION OF 
CLASSES BY MOTHER TONGUE 
(NSS, WBSD Study 1954) 
it; other urban d "rural 
a o n (170 ошен) (405 households) (754 households) 
san Bengali Hindi others Bengali Hindi others Bengali Hindi others 
СИ 4 Li 
(1) o (3) (4) (5) (6) (7) (8) (9) (10) 
1, goa “т: 5 "ETE NW Ese 
70 72 58 112 Д 
9 мело (94:9) (75.8) (898) (91.2) (689) (88.6) (78.1) (70.0) — (08.1) 
0 
2. адо-аззевзей 78 > а ; aa an Е эв К i 
c зу a2 00D $0) (т) Q9 (7.5) (12.5) (13.4) 
А 
3. _age-assesser 51 19 5 7 9 17 479 т 72:0 
| > ae qr) eoo 65) (er) мо E9 — (040 (070) (185) 
| : 436 95 59 1238 379 193 3325 40 157 
9. WEN (100.0) (100.0) (100.0) фо (100.0) (200.0) (200.0) (1000) qon0) 
0 
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TABLE 1(3): FREQUENCY DISTRIBUTION OF THE NUMBER RETURNED AT EACH 
INDIVIDUAL AGE BY SEX 


(NSS 4th round 1952, all-India urban sample) 


NN ——— 


age male female persons age male female persons 
(years) (years) 
о (2 (3) (4) | (1) (2) (3) (4) 
E 394 372 766 | 45 319 252 571 
1 368 389 757 | 46 oTi 65 142 
E 460 389 849 | 47 71 52 123 
3 378 375 753 | 48 128 101 229 
4 391 348 7399. | 49 75 47 122 
5 361 350 711 250 327 577 
6 350 359 709 54 38 92 
7 346 337 683 132 85 217 
8 349 367 716 49 49 98 
9 337 239 626 64 38 102 
10 393 395 788 | 170 189 359 
1 283 287 510 | 60 70 130 
12 436 416 852 | 48 31 85 
13 283 263 546 | 53 48 101 
14 358 326 684 25 17 42 
15 297 283 580 60 203 207 410 
16 353 319, 672 61 20 31 51 
17 246 241 487 62 4l 44 85 
18 391 396 787 | 63 23 18 41 
19 222 201 43 | 64 28 21 49 
20 399 445 844 65 84 110 194 
21 23 161 384 66 19 15 34 
22 349 702 67 27 20 47 
23 149 373 68 16 22 38 
24 216 їп | 69 12 8 20 
25 401 418 819 70 80 100 180 
26 261 227 488 | 71 1 8 15 
27 195 168 363 72 17 12 29 
28 . 288 254 542 73 и 15 26 
29 118 123 241 | 14 9 5 14 
| 
30 487 429 916 | 75 28 40 j 
31 123 98 221 76 2 7 % 
32 276 202 478 | T 6 1 7 
33 135 101 236 78 4 5 9 
34 150 95 245 | 79 1 5 6 
35 410 80 20 Р 
36 155 81 4 E: gi 
37 115 82 6 7 13 
38 174 83 =, 1 1 
39 88 | 84 1 6 7 
40 378 378 756 85 5 
41 82 51 133 86 B 1 " 
42 149 121 270 87 = = = 
43 86 73 159 88 1 " * 1 
44 81 88 169 89 2 E 2 
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MISCELLANEOUS 


A PARTIAL ORDER AND ITS APPLICATIONS TO PROBABILITY 


THEORY 


By T. V. NARAYANA 
Institut. H. Poincaré and McGill University 


which generalizes the **probléme 


esult in combinatory analysis, 
Relations of this 


SUMMARY. Application of a r 
order defined on the partitions of an integer. 


lu scrutin? 
du serutin of D. André to a partial 
partial order with two problems in probability theory. 


]. DEFINITION OF PARTITION OF 7 


we define an r-partition of n as follows: 


Given an integer ”, 
1), is a set of t; where f; > 


An r-partition of 2, (s. = 1 is an integer for û = 1, ....7 


such that 

ite. +t, = n. 
We remark that, in general, we shall consider (һ,% sacs ty)» 6, о fr), Where 
tittat ... +t, = n, as distinct r-partitions of n, unless t Tf r is an integer such that 
1 <r <», we have, obviously, ( lt ) distinct y.partitions of n. 


= fy 


2, PARTIAL ORDERING or THE T-PARTITIONS OF 1” 
; n—l). s 
"(1<" « n), let us consider all the ( asl ) 7-parti- 


Given any two integers и, 
rtition of n 


tions of n. We shall say that an 7-p2 


(yy eee r) 


“ i " , 
dominates" the r-partition of ? (fi, о tr) 


if and only if һ>а 
itt > ate 
ib e dea EU The 
Evidently tt. +h = Hae +t, = № (2.1) 
о transitive and anti-symmetrie, Tt 


The relation of domination defined by 
artial ordering of the 7- E - 
„.., f,) is an r-partition of 


thus represents a p 
2.1) are satisfied. 


tion of т, and (ti; 


rti ү 
oe ') if the relations ( 


„ 5) is an р : 

» 4) dominates (ts n 

г nl ) r-partition 

mber the pe we 

Taking the partition py let ti 
> 


, More generally, if (6, -- 
Й 
» Where m > n, we say that (В, = 

в off, taken in some order, using 


Let us suppose we nu 
ppos: е the number of partitions 


t 
he symbols Pr Pa Poly 


g-t 
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i w tal 
dominated by p; in the set Pp Pa -s (2-3) {== d. d . The tota 
(n, т) 224 ... 2-1) obviously does not depend upon the particular ordering chosen 

А nel ; 
for numbering the r-partitions of n. We state, as Lemma 1, the following result: 


Lemma 1: For all integers n, r 


n n | 
(n, 7) = ( ) ( ) n. 
r r—1 


We shall domonstrate Lemma 1 in section 3 using a geometric interpretation of the 
r-partitions of п. 


3. A COMBINATORIAL PROBLEM 


Suppose we have a particle at the origin of an Euclidean space of k dimensions 
(k finite) and consider 5 mutually perpendicular axis. We shall be interested in points like 


(а, аз, ..., ay) where а, > а > ... > a, > l are all integers. We shall suppose that the 


particle can move on the network consisting of the points p following the rules given below : 


Let dia be the increase in the i-th coordinate at the a-th step. 


1) a; > 1 for all i, æ (i = 1, ..., k; æ > 1), i.e., at each step and following each 
axis the particle moves at least one unit. 
2) dy > üa >. am > l; 
ац-Е@ > ааз > ... > amtar > 2 
and, in general, for the «-th step 


а а 


` ay > ` a'j (>a) when i € i'i, i' = 1,...‚Ё 


б 
j=1 j=l 


Let us suppose that we know that at the ^ 


-th step the particle has reached the point 
(dy, а... а), а 2 Ua >.. 


dun > а > т, and let (a, ..., ay), be the total number of different 
ways by which the particle can arrive at this point (or the total number of possible paths). 


Theorem 1: We have: 

(85s Ge = [(—D ec (9—1) (9—1) (rapa) 
(а,—1),—»› (9—1)... (44-1) (r44-8) 
(0—1) (05—1)o ga) (2,—1)4. ) 


where (a;—1)u = (a;1). 


This theorem generalizes the André-Poincaré “problème du scrutin" H. Poincaré (1913), 
E. Borel (1925), 
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Demonstration: We shall prove the theorem for k = 2. (The general proof will be 
given as a special ease in a forthcoming paper of the author). The network now consists of 


the points p(a,, а») where а, > а. > 1. 
Evidently (01,9); =1 for a 2 ds cm 
(ay, aa) — 0 if „a «2 


Or («ds 


Tf a, > d, > 2, 
where the summation extends over the points p contained in the 


(44; аз). = X (ау, йо) 
and the point (01, 43) aS shown. 


truncated rectangle formed by the origin 
a-l al 
Thus (ал, аз) = X 1 
aiza: аз=1 | 


== (a,—1)(,—1)—(a5—D 


We can prove easily, by induction, that : 
(91, 4), = (адозве Denar ЕЕЕ: 


Set ар = ds =". 
tition of n; and, inversely, 


-partition of n by an 7-раг 
corresponds a path. We 


n r-partition of n, 
(3, 1, 1) dominates (2, 1, 2). 


nation of an 7 


Each path represents а domir 
partition of n by & 


to each domination of an 7- 


xample with » = 5, "= 3. 


give below an © 


We thus obtain the result of Lemma 1; for : 


(n, 7) = (4 а), when 


n n 
or ais ee 
pésiré André from this result. 


bléme du Scrutin of 
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Й » 
4. AN EXTENSION OF THE "PROBLEME DU SCRUTIN 


The ‘probléme du Scrutin’ is as follows : “Two candidates A and B stand for an 
election. A well-informed observer knows beforehand that А will obtain m votes and B 
n votes, where т > n. What is the probability that A will lead B throughout the scrutiny 
of the votes?" 


Since A leads В throughout the scrutiny, equality of votes not being allowed, we 
might reword the problem as follows: What is the probability that A holds a. 1-lead over B 
throughout the scrutiny ? We solve below the more general question: What is the probabi- 
lity that A holds a L-lead over В, L being an integer so that 1 <gLgm—n? 


Solution : Given 1 < 2< т-п-1, A could hold the L-lead over B in the following 
mutually exclusive ways : 


The last vote is for B. 
The last vote is for A; but the last but one is for В. 
The last two votes are for 


A; the preceding one is for В. 
The last (m—L—n) v. 


otes are for 4; the preceding one is for B, 


Let us consider the case where the last vote is for В. Evidently for A to hold the 
L-lead over B, the first L-votes (at least) have to be for А. We now remark that there is a 
one-to-one correspondence between each domination of an r-partition of n (the votes for B) 


by an r-partition of m—L (the remaining votes for A) and sequences of m A's and n B's 
Where 4 holds the L-lead over B. Since 


(m—L, т), = 0—1) op) (т) (0—V gy, 


the number of ways in which 4 holds the L-lead over B, the last vote being for B, is 


n 
д (m—L, n), = т) тиф) 


The case where the last vote is for A, ( 


(n=3)+ 


the last but one being for B) gives rise, similarly, to 


I (m—L—1,n), = ( 


T=: 


8—08) (т-а 9) (1-3) 
ways in which A holds the L-lead over В. 


If just the last (m—L—n) votes are for A, we have Similarly 


n 


(n, n), = (n2). ,)—(2n— 9), 5 


7=1 


the L.lead over В is 


(m-+-2—L—1)y)—(m-+-n—L— 


Ving). 
The probability of holding the L-lead is thus 


m\(m--n—L)\m—L—n+.1) 
© бт+)!(т—1-Е1)Г 


which reduces to %7 for =i 
m--n 


94 


АР у 
PARTIAL ORDER AND ITS APPLICATIONS TO PROBABILITY THEORY 


The probability of holding the Z-lead and no better is evidently 


m\(m-+-n—L—1)!n(m—L+2—n) 
(m-+-n)!(m—L-+1)! 


For L = m it i 
= m--n, it 15 easy to se б н 
Ra , s easy to see that the number of ways in which 4 holds the L-lead over 


(2n—2)(n—1)—(22—2) -з) 


а result independent of m. 


We remark in passing that in order that A holds the 1-lead over В with probability 


one-half, it is necessary that т = 3n. A simple calculation shows that, when т is large, 
т should equal (24-4/5)n or 4 . 3n approximately, if A should hold the 2-lead over B with 


the same probability. 


5. A PROBLEM IN PROBABILITY THEORY 


Let us suppose that we are given two coins 1, 2 with probabilities р, p» of obtaining 
heads, and consequently the probabilities gı, da of obtaining tails where q; = 1—p;, i — 1,2. 
йы а 1 M 
We shall assume in what follows that рур > 1. 
Let us consider the game С, played with the following rules: Narayana (1955). 


1) The first trial is made with coin 1. 


2) For n> 1, the n-th trial is made with с , according as the result 


oin 1 or coin 2 


of the (n—1)st trial was a tail or head. 
at trial where for the first time the accumulated 


3) We stop the series of trials at th 

number of heads obtained (with both coins) is grea accumulated number of 
tails obtained by exactly 2. 

> 1, the probability that our game Gy will terminate 
as closely as we please. 

end only at the (2n--2)nd trial, n = 0; 1 و‎ 
with coin 2. We shall use a simple, self- 
Is, letting ® represent à head and o a tail. 


ter than the 


| Since we have assumed Pp, Pa 
in a finite number of trials approaches unity 


It is evident that the game б can 
and that the last, i.e., (2n--2)nd trial is made 
evident notation to represent a sequence of tria 


Й 
The sequence given below 


indicates that the game б is terminated at the 8th trial and that : 


The Ist trial was a head with сот 1. 
The 2nd trial was а tail with coin РА 
d so on. 


The 3rd trial was a tail with coin 1, an 
belongs to the series Sj, п > 0 being 


For example, the sequence 
belongs to S, We thus 


f Y 
als representing Go 
during the sequence- 


We shall say that a sequence of tri 
1, (3rd and 4th trials), 


i К : 3 А A 
nteger or zero, if we obtain n o’s with coin 1 


above, since it contains two tails with coin 
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classify a sequence belonging to G, according as the number of o's obtained with coin | in 
this sequence, into mutually exclusive classes Sp, Sy, Sy, ..., Sy, .... Any sequence iE Go 
necessarily belongs to one and only one of these series. We shall study the first few of these 
series to define the concept of "base" sequence. 


Series Sy: Since any sequence of G, belonging to S contains no o's with coin 1, it 
is clear after some consideration, that бу contains sequences of the type: 


3 


x ох оох 


and in general, a sequence containing (2n-++2) trials and belonging to 4, can occur only in 
one way, viz. : 


en ә 
uu vc 
00 0: 26; 


ie., n patterns of the type ay followed by x, which terminates the event. We call x, a “base” 
sequence for S, for reasons to be obvious shortly. Any term of S, can be obtained from the 
base sequence by adding an approximate number of "subsidiary patterns" x suitably, 
Le. so as not to contradict the rules of Gy. 


Series бу: A little consideration will show that there exists one and only one base 
sequence for S, from which every term of S, can be obtained by adding 


an appropriate number 
of “subsidiary patterns” of the type uo Or xy suitably. 


The base sequence is given by: 


ох 
Cu 


and we can add z, instead of any line indicated by and o* 


instead of any line indicated 
by/ , in the base sequence as shown below : 


Thus, for example, if we would like to obtain 


all the sequences of G, belonging to Sı 
containing 6 terms we need only add ay or ot suit 


ably, giving the Sequences 


a шош . OG 
1 (ii M ох g 
(i) ) ona ; iii) AL 


The problem is the same as that of putting = = 


l ball in 3 boxes, the boxes being 


indicated by the sloping lines in the base sequence. In general, if à sequence of 8; 


con- 
tains (2n-+4) terms, n = 0, 1, 2, ..., this sequence can occur in the same number of ways as 
that of putting лж = n balls in 3 boxes or in Ey ways. 
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Series Sa: We shall state that there are 2 base sequences for the series Sa, VIZ., 

о \ о x / x / 

/ .N \ / 


"boxes" is 5, indicated by the sloping lines and if a sequence of S, consists 


.., this sequence can occur in 


dd E ( | ways. 


The number of 
of (2n-+6) terms, n = 0, 1, 2, . 


6. DEFINITION OF BASE SEQUENCES 


al number x of tails occurring with coin 1 during a 


Given the series Sn, i.e., the tot 
of sequences of observa- 


sequence of trials of Gs, a set of base sequences 
tions in S,,, from which all other sequences of t 


any number of subsidiary patterns of the form ay or 
Set of base sequences will be called a base sequence. It can be shown easily that a base 
can contain either (2n--2) or (2n--4) ог... (4n) trials. For n > 1, 

of length 7, as a sequence (2n--2r) trials (1 <r <™. 
ed by simple methods) which gives us all the base 


the relation of the game G, with the dominations of 


for the series is a se 
he series can be obtained by inserting suitably 


о in the sequences. Any sequence in this 


sequence of S, (n > 1) 
let us define a base sequence of S, 


V, x : d 
We state a theorem (which can be prov 


sequences for S, (n = 1, 2, ...) and shows 


the partitions of an integer. 
o: To every domination of an. r-partition of n by another there corresponds a 


Theorem 2 
base sequence of S, of length т and conversely. 


Thus (т, т) = (З Ры 1) т represe 


uences of Sy is 


nts the number of base sequences of S, of length 


r and the total number of base seq 
n ae i 
ый > | 


Making the convention (0, 1) = 1, i.e., the number zero dominates itself, the theorem 


is true for all integral № > 0. 
ite number n of trials, 


since the probability, that the game бъ ends in a fin. 


Finally, 
we have the identity 


approaches unity as 27%; 


Я 2р8 dip P2 а) 
a s a nat браз CHADH pie ЖОГУ, 
192 2 2 


‘ ў А i = 1, 2), 
where ^+ > i p, = 1—4 = 1,2) 


the general term of this series being : 


oibus 2 [( т ) ( i )+(3) ( |) pide ] (A) pq ] " 
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RANDOM PROCESSES IN ECONOMIC THEORY AND ANALYSIS 
By P. A. P. MORAN 


Australian National University, Canberra 


SUMMARY. Tho various models of discroto paramotor random processes used in econometrics 
and economic theory aro reviewed, and a summary given of the known results in the theory of testing hypo- 
thosos and estimating parameters for such models. It is shown that for empirical economic series it may 
bo difficult or impossible to make an adequato verification of the hypotheses on which such methods are 
basod. Tho oxample of testing the corrolation betweon two short series is considered in details. Finally 

m for multivariate processes. 


an outlino is givon of tho identification and estimation proble: 


]. THE USE OF MODELS IN ECONOMICS 


her useful or substantiated when it has been related 
almost entirely of a numerical kind. There are so 
to be able to summarize them or otherwise describe 
d of descriptive statistical method is necessary in 
ever, they rarely show clear patterns which ean be 
Tt is this fact which makes economics such 


Economie theory can only be eit! 
to empirical facts which are necessarily 
many of these facts that it is necessary 
them in short form and thus some kin 
economics. Even when this is done, how 


easily interpreted in terms of economie theory. 
à different science from physies. Moreover, unlike physics agam, there are usually so many 


other influences, often unmeasurable, at work besides those considered in the theory that 
e kind of pattern appears to show’ itself, there can be no great assurance 
andom causes not considered in the theory. 

go further and use theoretical statistics in an 
the data are of а systematic kind and which can be 
ndom influences. Only in this way can one guard 
from observed patterns which are in fact only 


even when som 
that this is not due to т 


For these reasons it is necessary to 


attempt to sort out which influences on 
regarded as, in some sense or other, rà 
against the easy errors involved in arguing 


due to chance. 
base 


Theoretical statistics is necessarily d on probability theory and our method of 


Procedure must therefore be to construct a theoretical probability model which, we hope, is 
al processes resulting in the observations used. "This setting 


& close repres i f act 
entation of actua Л Е Е 
Е or all of the variables considered are random variables, i.e 
:е., 


up of a model in which some Ales А EU Краткое 

have associated with them probability distributions, is am essen ^ PP sil E 

Paring an economic theory with empirical data. а dnt: i els involving pro- 
) 2 : са: а 

bability may also throw some light on economic theory: or example some dynamic eco- 

able when small random influences are 


H . H : t 
nomie models may be shown to be intrinsically uns А 
incorporated into the model. Again one plausible theory to explain why some economic 


quantities show a quasi-cyelic behaviour is that ШЕР fisse in B rs determined 
by a set of equations determining a stable equilibrium P. xd. is p pm is disturbed by 
random shocks. For certain values of the constants 8 ve md vu although stable, 
may show a tendency to overshoot its equilibrium values and thus show a quasi-cyclic 


behaviour. 


In setting up a mode 
are to be supposed to have pro 


Lit is first of all necessary to be quite clear as to which quantities 
pability distributions. It is therefore convenient to distinguish 
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between quantities which are supposed to remain fixed in ig Hoel ana НЫ TN 
a probability distribution. We shall call both kinds of quantities rdg i S. = e 
ables with a probability distribution we shall also call variates. The joint ауана ; ч 
the variates in the model may be called the probability set of reference. This E 5 pipes 
tual set of possibilities which is used in order to draw inferences from the statistical a " 
The value and necessity of making quite clear what this set is in each particular case wi 
be illustrated in the examples which follow. 


: — ” ledge of 
The grounds for adopting a particular model are partly background knowledge i 
T^ nati 3 р tua 
the situation and intuition into its structure, and partly based on examination of the ac | а 
i Ё E леа 1er 
data. Thus, for example, in a physical measurement of a length, past experience toget | 
i i istributi ° errors ° measuremen 
with theoretical knowledge both suggest that the distribution of errors of me: pil 
may be supposed to be the normal distribution. Further evidence may also be obtained bj 


applying tests of normality to the observations, but, if these are few in number this evidence 
may be slight. 


As will be shown by later examples, a false model may lead to totally incorrect 


conclusions. On the other hand, such a model may, in some circumstances, be quite useful, 


especially in prediction. For example in studying annual sunspot numbers a prediction 
based on a regression equation. connecting each annual value with the two previous values 
will result in a fair predictor; but this is certainly not an adequate description of whatever 
process really underlies the system. 


Having set up a probability model of a plausible kind we may proceed to test various 
hypotheses and to estimate the various parameters in the model. 
of any hypothesis requires some care. 
from the hypothesis which is in view an 
idea of the power of a test. 
ficant results, not because th 
model is not correct. 
different from some s 


The setting up of a test 
The test must be relevant to the kind of divergence 
d this requirement has led to the introduction of the 
Moreover the test criterion may tend to show 


Further difficulties arise when we turn to the problem of esti 
the model. . To simplify the situation as much as possible suppose the model involves para- 
meters 01, ..., 0р and the observations are а sample 2, UU OBS single variate which has а 
probability distribution (2/03, 


-..бь) specified by the model. From the sample we construct 
functions &j(vy, ..., 2), ..., ал, ..., va) which we use аз ou 


é т estimators of 0}, ..., 0,. We 
do not consider here the various well-known criteria which it would be desirable for such 
estimators to possess but ask the more fundamental 


cae question whether the 0; are in principle 
estimable at all. 'This is the problem of identifiability and may be illustrated by a simple 


example. Suppdse that we set up a model to explain’ a set of Observations {а} in the follow- 
ing way. We suppose, from our knowledge of the Situation, that ж — y-+z where y and 2 
are independent normal variates with means Mı, ть and standard deviations o, and 65. 
Then we soon see that not even the introduction of Banach spaces will enable us to estimate 
these four quantities separately and they 


are, in principle, not estimable individually. They 
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We can, however, estimate mj--m, and о1--08 and these 


are said to be unidentifiable. 
A practically more relevant example of non- 


quantities are therefore said to be identifiable. 
identifiability is that of the constants in a pair of demand-suppiy relationships and a clear 
discussion of this is given by Koopmans (1949). 

The problem of identification is not only of importance theoretically but also practi- 
cally, for a policy decision in an economic situation may have an effect which depends on a 


parameter unidentifiable from the observations and the effect of the policy decision may 


therefore be unpredictable. Illustrations of this and other examples of non-identifiability 


will occur later. 
It is rare to have a situation in which there is only one kind of measurement or 
variable. We usually have to deal with situations in whieh each element of the sample 


consists of two or more measurements. Before considering how random processes enter 
into our models, it is of value to consider the present situation of the theories of correlation 
and regression, а subject in which there has been much confusion. Let us suppose we are 
confronted with an empirical situation in which we have n pairs of values, (21, 91), «+++ (ns Yn) 


and we are going to set up à probability model in order to make inferences from the data. 
g 
oment, all the problems which result from the use 


assume that whatever model we use, different pairs 
To investigate the relationship between 2 


As we are going to ignore, for the m 
of random processes we shall always 
(а, Yi), (8, yj) will be distributed independently. 
and y we can set up at least four different types of model. 

Consider first the model of linear regression (non-linear regression models are of the 
sidered in the same way). Here we suppose that the 2; are simply 
ariables with no probability distribution attached to them. 


We shall suppose the yj, on the other hand, have à probability distribution whose mean is 
a linear function of x. 16 is usually satisfactory to assume further that this distribution is 
normal or Gaussian and that its standard deviation is a constant (nearly always unknown) 
Notice that in this model all the probability is related to the y variable 
his model y is à variate and ж only a mathematical variable. The 
inferences are related then consists of the joint distribution of 
d set of u's. Using this distribution we can set up various tests 
For example, if the mean value of y is taken to be a linear 
esis that р = 0 by the test criterion 


same type and may be con; 
mathematical quantities or V 


independent of z. 
and so we say that in 6 
probability set to which our 
all the y's for the fixed observe 
of significance and estimators. | 
function, a-- fa, of v we can test the hypoth 
t= bsyy/(n—?) 

= (82—989) 
ribution with n—2 degrees of freedom. Here 


which is distributed in Student's t-dist 


PE 
1 т} 
= 269 


re a, В and the variance of y for fixed х. АП these 


The parameters in this model а 


parameters are identifiable. 
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A model of completely different kind is obtained when both variables are assumed 
to be random variables, i.e., we assume that the pair (z,, у;) are jointly distributed in 
a bivariate probability distribution and we may ask first whether there is any evidence 
to show that 2; and у; are not distributed independently. The natural tool to use in this 
case is the correlation coefficient 


== X(y—9)-x—2) - : 
vily: — 7)? X(v,—2)* 


Under the hypothesis that z and y really are distributed independently, it is easy to 
show that the mean and variance of the distribution of r are zero and (n —1)-! respectively. 
The exact form of the distribution depends on the distribution of and y but tends to norma- 
lity with increasing z. It is a remarkable fact that if we now assume merely that one of the 
two variates z and y is normally distributed, we can deduce the exact distribution which is 
given in the standard text books and is well tabulated, and which applies, a fortiori, when 
both z and y are normally distributed. Tt is known, moreover, that the null distribution 
of r is relatively insensitive to joint variation of the distributions of x and у from normality 
and in this respect the test based on r is said to be robust. 


We also notice that, purely algebraically, 


i= ra/(n—2) 
VE 


and that the distribution of r, when a and y are independent and one 
be obtained from that of t. In the two cases we calculate the s; 
assumption of the null hypothesis, ascribe to it the same distribution. The two cases are. 
however, fundamentally different because they refer to different probability 

of reference. In the regression case we considered the universe of all possible values of y 
given the fixed observed values of x, whereas in the second case we considered a universe of 
reference in which the 2’s also vary. The difference between the t 
of practical importance w 


that £40. 


Finally we notice that in the correlation model, the parameters of the population, 
the means, variances and correlation coefficient of z and y are all identifiable and can be 
efficiently estimated from a sample. 


à normal variate, can 
ame quantity t, and under the 


sets or universes 


wo tests becomes a matter 


hen we consider their power relative to specified alternatives, e.g. 


A model of more relevance in economic applications is that known as the ‘Error in 
riables Model’. Here we: i 
Variable: el’. He ө we suppose ae observations (22, у) to be random variables obtained 
as measurements or estimates of quantities u;, vi Which are linearly related. То be specific 
we suppose that v; = a+ fu, (i = 1, .. F 


., 4) where « and В are iti ibi 
quantities describing the struc- 
ture of the model and the и;.0; may be either fixed quantities or variates but ar : 


Ё e, in any case, 
unobserved. Next we suppose that ` 


t; = щ+е; 
yi = UN, 


where the e; 7; are sets of random variables which are independent of each other for 
different i and which may, in most cases, be assumed to 


be independent, Tt will then be 
found that æ and f are not identifiable. 
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To be more specific suppose that и; 15 a normal variate with unknown mean m and 
unknown standard deviation сү. v;is then a normal variate with mean a+ рт and standard 
deviation fo. Suppose that є;, 7; are normal variates with zero means and standard devia- 
tions с, and сз. Then it is obvious that 2, and y; are normal variates with means m a+ fm 

a d : s › 
standard deviations ү/(02--03), 4/(22024-05) and correlation 


— foi 
v/(e1-2-03)4/ (f?a1--03) 


It is easy to see that these five quantities are identifiable from the sample but even 
if they are known exactly, we could not deduce from them the values of « and # which are 


thus unidentifiable. 


The importance of this in economics arises from the fact that it may be important 
to know the values of а and р rather than the observed regressions. For, if by some policy 
decision we could increase u; say, by some specified amount, we would like to know what 
would be the effect on Z(y;), the expected value of у;, and this is impossible if we do not 
know f. 

Although we cannot estimate 2, it is possible to place bounds on the underlying rela- 
tionship between v and w. This linear relationship must pass through the true means of 
гапа у (a point which can be estimated) and have a slope intermediate between the slopes 
of the lines giving the regression of y on x and з on y. More than this we cannot say without 
some further knowledge, such as, for example, the ratio of the variances 03 and оў. It may 
in fact be useful to know that the true relationship lies between certain limits and this, in a 
more complicated context, is the idea underlying Frisch’s bunch map analysis (see for example 
Stone! (1945, 1954). Furthermore it is also possible to take account of the uncertainty of 
the sample values of the regression coefficients and set up a test for 8 specified value of f 
(Moran, 1956). This test is only useful in some circumstances in enabling us to reject some 
particular values of A and for large samples the region, in which rejection does not take place, 
does not vanish. Another method of getting around the present difficulties is the use of 


‘Instrumental Variables’. For this see Stone (1954) and Reiersol (1945). 
There is, finally, another model which is interesting in itself but not likely to be useful 
; , 


This is the Berkson model (Berkson, 1950; Lindley, 1953). Here we suppose 
that ау, ..., v, are à set of previously specified fixed values such as might be chosen as a set 


of values at which some parameter in а physical experiment is to be fixed. However 
t be prescribed exactly and are in fact 21, ...,2, where 
which are independent of each other 


in economies. 


the actual experimental values canno 
2,—2; are random variables, the errors of specification, 


and of the a's. We then suppose that 
| y = at Pate 
Where the e; are further random variables independent ofz;andz;. In this case it is easy 
e the є; This is the situation which is likely to occur in physical 


to see that æ and £ are identifiable. | 


experimentation but does not seem relevant in economics. 


n: ls the undorlyin: relationship a “regression,” and also, for example 
i E Stone calls the underlying 3 é к ple, 
Ж} : 1Notice, er ia y жосу qiiae gocce В о 
kes the regression of x on У, resulting re np. gression.” This is n 


usual terminology. 


inverts it and calls 
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In all the above models we have assumed that different pairs (дү, и) are independent 
of each other and have the same distribution. Largely as a result of the writings of Yule 
(1926) it became realized that there are many cases in which this does not apply and that 
very misleading conclusions may result. Yule caleulated the correlation between the stan- 
dardized mortality per 1000 persons, and the proportion of Church of England marriages 
to total marriages for the years 1866-1911 and found 7 = 0.9512. This appears to be highly 


significant by the ordinary test. He rightly describes this as a ‘nonsense correlation’ and its 
origin is clearly due to the fact that both series of figures show a strong trend. In this case 
one can regard the nonsense correlation as being due to the fact that both variates are strongly 
dependent on a third variable, time, and the fallacy of ascribing a causal connection is thus 
an elementary and well-known one. A less obvious fallacy arises when neither series shows 
a trend but both are serially dependent. If we have two series, such as annual sunspot numbers 
and some economie variable which has no trend (or has had the trend removed), we may 
legitimately regard the sample correlation between the two series as an estimate of the true 
correlation but we cannot test whether it is significantly different from zero by the ordinary 
test, since the latter is based on the explicit assumption that successive pairs are statistically 
independent. 


The ordinary statistical models as described above are thus valueless for much serially 
dependent data and hence we have to construct a new theory in which successive observations 
are serially dependent on each other. We are therefore led to the theory of random processes. 
The purpose of this paper is to survey the present state of this theory in so far as it is relevant 
to the econometrician. Much work not of direct interest in economics is not discussed 
(e.g. the investigations on spectral analysis by Grenander, Rosenblatt and others) and in 
addition the choice of subjects is somewhat biassed towards those with which I have come 
in contact, or been instructed in by my colleagues. 


2. RANDOM PROCESSES WITH ONE VARIABLE 


We must first decide whether the random processes we wish to study will involve 
time in a continuous or a discrete manner. For many applications in physics it is more con- 
venient to take time as continuous and then we have to construct a theory of random func- 
tions w(t). This can be done, but to obtain a strictly rigorous theory EN a considerable 
mathematical apparatus. Moreover the data we deal with in ёбопоппёв are always given 


at discrete intervals. It is true that such data are often not the value of a variate at an 


instant of time but a sum or integral over an interval of what might be perhaps better 


regarded as something going on continuously. Nevertheless the simplification resulting from 
considering discrete moments of time, separated by intervals of constant length, is great and 
we therefore confine ourselves to this case. i 


We next have to consider what unit to take for the time interval which might be 
years, weeks or even days. In most cases a year is the interval considered. "This usually 
results in quite short series, but if we attempt to increase the length of the series by taking 
months or weeks, we may get into trouble. Not only may we now h 


"e К Е ave seasonal effects to 
eliminate before further analysis but we find that we are now 


; p з Ў р studying smaller scale pheno- 
mena, in the time sense, and if there is a lag or serial dependence between two series, this will 
› 


be spread over many more intervals than before, usually resulting in much heavier computa- 
tions. Thus the gain in apparent accuracy by increasing the number of terms may be illusory. 
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On the "hi i i 
: the other hand, economie effects are probably fairly short term and perhaps of the ord 
: | | е order 
one to three months. One month may therefore be the best unit, if seasonal effect 
1 à , и S can 
In most cases, therefore, we will want to consider a series of variates 
es, 


be removed. 
.say) where the unit in which £ is measured is one year or one 


tj (b= 92, —1, 0:172, 
month. 
If the joi istributi у 5 Я 

| ; he joint distribution of any set of these (zy, vy, ..., ty) depends only on the dif- 
erences betwee т ix Ў i $ à 
E | = een £, u, ... v and not on their absolute value we say such a process is stationary. 

wactice the tw B reas hy thi i rar i 
ee т je tw о ЮЗА usual Teasans why this assumption does not apply are the existence 
f seasonal effects if the time interval is less than one year, and the existence of trend caused 
by technological or other economic development. The first may be removed by using a 
correction such as might be obtained from a thirteen month weighted average, or from the 
totals over the years for the individual months. Neither of these methods can be regarded 
ably not possible to find much better methods. A trend may 
or higher order regression on time as an auxiliary vari- 
gression which will be discussed 


as very satisfactory but it is prob 
be removed by taking out a linear 


able. "This introduces the complications connected with re; 
remove a trend by subtracting a moving average. Besides 


later. Alternatively we may 
ature of the serial dependence. For 


lied this tends to distort the n 
processes which are stationary. 


at to assume that the joint distribution of any 


set of the ав is a multivariate normal distribution, but even if this is not assumed we shall 


always assume that the second moment (and therefore also the first) exists. We may take 
to be zero and the second to be о?, so that we write 


lation coefficients. ps, by 


shortening the series stu 
the present, therefore, we shall consider 
it is also convenier 


For many purpos 


tho first moment or mean 
Ela) = 0, Ele?) = о. We then define the serial corre 


) 


= оёр 


с?р, = E(t: % 
and it is convenient to introduce а serial covariance generating function 


y = 2 $ ‚$ š 

SQ m RE 

(1947), Moran (1949). This series usually con- 
> 0, but even if it does not, it can be taken as 
The advantage of using such a generating 
an equation of the form 


able (Quenouille 
|< 14-8 where д 
unit circle |z| = 1- 
а new process {yy} by 


where z is a complex vari 
verges in a ring 1—9 < |z 
defining a function on the 
function is that if we define 


со 
и = È Aiti» 
4=0 


series, then the serial covariance 


nated by а convergent geometric 


© P 
where X a; is (say) domi: 
0 


the {y} process is given by 


generating function of 
> о > 
= wi) ) dae 2 
sye) = ( Ee ) (= 
h of the elementary algebra connected with simple processes, 
The fund mental fact about serial correlation coefficients is given by Wold’s theo- 
he fundar act ё dem . : | 
rem (1938) which is the analogue for discrote processes of the theorem for continuous pro- 
cesses 2 ; à by Khintchine (1934) This states that the necessary and sufficient condition 
8 proved by ; 
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that any arbitrary sequence of real constants (pg, Е = 0, + L,... are the serial correlations 
of some discrete stationary process is that there exist: a non-decreasing function W(/) 
such that 


W(0) =0, (r) = т, 


1 " 
DLE | cos kodW(0). 


and 


o?W(0) is then known as the integrated power spectrum of the process and a*W'(0), if 
it exists, is known as the spectral density. If W'(0) exists everywhere in the interval (0, 7) 
and satisfies certain very wide regularity conditions, it is given by 


W'(0) = 1--23 py cos k6 = Se^). 
х 


The idea behind the introduction of the power spectrum comes from the study of 
processes with continuous time used to represent noise in electrical theory. Here the use of 
electrical filters leads to the idea that an arbitrary current (possessing some kind of station- 
arity) can be represented as a sum of periodic components with random amplitudes and 
phases. The total energy of these currents is the sum of the energies associated with each 
and the power spectrum is a measure of the proportion of the energy corresponding to each 
frequency range. When we have a discrete sequence instead of a continuous function, there 
is an upper bound to the frequencies required and W(0) is only defined for an interval (0, 7) 


whereas in Khintchine’s original theorem W(0) has to be defined over the whole interval 
(0, оо). 


Consider now various simple models. The next simplest model after a completely 


random series is obtained by defining a process {e} in terms of a completely random sta- 
tionary process {e} by the relationship ` 


җ = pua. 


This has come to be known as a simple Markov process. Itis clear that for {ay} to be æ 


stationary process we must have |p| < 1, and by multiplying by z, ,(s > 0) and taking 
expectations we see that p, = p_, = p°. co 


i This process possesses what has come to be know? 
as the Markovian character, namely, 


ХАТЫ that if æ is known, the conditional distribution of any 
set of x’s with suffixes greater than t, given zy, is independent of all the as with suffixes less 
than 6. | 


А more general model is obtained by defining {x} in terms of {e} by the relationship 


UPA ar ses FUR tip = e 


In order that this generates a stationary process, it is not hard to see that we must impose 
the condition that all the roots of the equation i 


ziazat ... +a = 0 (2.1) 
lie inside the unit circle |z| = 1. The calculation of the seria] correlations is a little more 
complicated but is facilitated by the use of the serial covariance а function Bal) 
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of {x}. Si g g v v elf 
{v}. Sin у 
}. Since {e} can be regarded as generated by a mo ing average of {ау} and has itsel 
A t ias its 


a serial covari i i 
a serial covariance generating function equal to unity, we have 


(l+az+ ... + а) (А-а ... гар") S, (2) = 1 


Salz) = (1d- tse -- pagt (14-032 ... ва). 
pe of model, which was introduced by Yule (1927) and is some- 


ti t i ies, i 
mes known as an autoregressive series, 18 that by choosing thea, so that the roots of (2.1) 
eries which show oscillatory behaviour, i.e., a tendency d 


series the models of this type which have been most 


ünd so 


The advantage of using this ty 


des i 

are complex we can obtain s 

о "Swi i 1 
verswing the mean. In univariate 

used have k = 2. 

s easy to discuss is the finite moving average scheme 


Another type of model which i 
.. ав р Where dg > 0, ар > 0, and {eı} is а com- 


Here we suppose % = aoe 16-17 + 
pletely random series. Then 

28, (2) = (04-424 + а (ана +... таю 

It is sometimes implied in the literature that taking a moving 

g 


es introduces ‘oscillations’. If all the а; > 0, this is not 
e if we use the word ‘oscillatory’ to mean that 


on there will be a tendency to overswing the 
ccur, if some of the weights a; are 
s that a moving average with positive weights will convert 
h shows an appearance of wandering about the mean 
alues tend to be alike. This can be easily mistaken, at first 
and has sometimes led people mirtakenly to suppose that 


be explained in this way. 
ave sometimes been used to remove 


=.) m (22) 


and p, = 0 for |s| > k. 
average-of a completely random seri 
so, since all the serial correlations are positiv 


if the system is disturbed from its mean positi 
mean, Such a genuine ‘oscillatory’ tendency сап only о 


negative, What does happen i 
an irregular looking series into one whic 
and in which neighbouring. V 


sight, for an oscillatory behav 
haviour can 


especially with eq 
(aj) we migh 
4+. Ра ° ела БК" 

o be studied. If kis not too small, the 
ts with long periods and leave 


iour 


apparently oscillatory be 
Moving averages, ual weights, h 
t estimate a trend value at t by 


trend.” Thus given a sequence 
X, = (ew Me 

and then regard {f7 Xj) as the stationary process t 
effect of this on the spectrum is clearly to remove the componen 
the short-term components relatively unaffected, а linear regression with time being treated 
finitely long period. Thus the spectrum and therefore the correla- 

we use a moving average of greater extent. 


as а component of inde 
tional properties of the series is less affected, if 
long and the use of a long moving average 


Unfortunately economic 8 
usually means throwing away to t each end. 
Finally, another model, which is historically the earliest, is that of ‘concealed periodi- 
cities.’ Here we suppose that 2; is the sum of a finite number of strictly periodic terms 
posed error, so that we write 


(usually taken as trigonometric) together with a зирегіт 


as t! 


eries are not very 
о many terms а 


а= $ А, cos (a+b) +ê 
1 


itself be & process with serial dependence of 
odel for dealing with tidal, meteoro- 
Jear that there are strictly periodic 


ries or may 
This is the natural m 
mena, in which it 18 € 
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components. It was this idea which led to the introduction and use of the жиш 
Whilst this type of model is very successful in dealing with some phenomena, e.g., the tides, 
it was its failure with the annual sunspot cycle which led Yule to introduce processes of an 
oscillatory nature without strictly periodie components and so to introduce random processes 
into statistics. The processes of concealed periodicities are unlikely to be of much interest 
to the econometrician except in aiding him in understanding what happens when he removes 
seasonal variations. 


Slutzky (1927) established some interesting theorems on the effects of summing and 
differencing random series which are often and mistakenly quoted as illustrating a way in 
which oscillatory behaviour might arise in economic series. If we take a completely random 
series (cj) and take a moving average of two by the formula a, = в 6-1; and repeat 
this process a further n—1 times and then take the m-th difference, we obtain a curve 
remarkably like a sine wave. In fact if we let n and m increase, the series can be 
represented by a sine wave of period L given by 


cos 27 L-! = 


to a degree of approximation which gets better and better as n and m get larger and larger. 
By the use of covariance generating functions this can be proved and generalized in a much 
easier manner than in Slutzky's paper (Moran, 1949 & 1950). This result has in fact no econo- 
metric implications, and if we simply take repeated moving averages with positive weights, 
we do not get an oscillatory process. The heuristic reason for Slutzky's result is as follows. 
The effect of taking a moving average with positive weights repeatedly is, in effect, to generate 
a moving average whose weights can be approximated by a multiple of the normal or Gaussian 
distribution. This is simply a result of the Central Limit Theorem. The effect of taking 
the m-th difference is to turn this moving average into one whose weights can be approxi- 
mated by the m-th derivative of the normal distribution—the m-th tetrachoric function. 
In the main part of its range this function mimics a sine wave (for a proof see Szegó 1939 
p.194). Thus all that the process of adding and differencing has done is to produce а 
moving average with weights closely graduated by a sine wave. The resulting process thus 


also mimics a sine wave. It does not appear that this result, however interesting in itself, 
has any relevance in econometrics. j 


The main value of the theory of random processes in economics lies in the fact that 
it enables us to construct models which may be fitted to observed ser 


ies. However the theory 
throws some light on economic theory. Samuelson (1947) has shown how economic statics 
requires a dynamic theory for the discussion of stability problems. Most linear dynamic 
models, when their parameters are plausibly chosen, produce damped oscillations heñ they 
produce oscillations at all. Unless, therefore, we suppose that the systems are essentially 
non-linear, continuing undamped oscillations of bounded amplitude, such as we require for 
a trade cycle theory, will have to be explained in some other wav. The introduction of 8 
random or stochastic element into a linear model shows how we cam set up such an ‘endoge- 
nous model,’ for if we have, say, a simple linear model of the form а-аа, ба = 0 with 
the constants chosen to produce damped oscillations, the process can be kept continually in 
a state of oscillatory behaviour by disturbing this equation by putting a random element 
on the right hand side. In this way we can avoid the assumption of non-linear models or 
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atory factor to account for the trade cycle. This has 
With a more complex model involving two or more 


the existence of an exogenous oscill 
been known, of course, for a long time. 
variables it is not even necessary to suppose the v 
than those in the immediately preceding time interv 


ariables depend on any further variables 
al, for (as will be seen later) a system of 


the form 
а = 4110-1716 
g= ex, a4 -dyia th 


can also show oscillatory behaviour. 


he problem of statistical inference from observed series. We shall 


consider in turn (1) the estimation of the mean and variance of the process; (2) the testing 


of serial dependence and the estimation of serial correlations; (3) the estimation of the 
and the testing of their goodness of fit; (4) the 


We now turn to t 


parameters of autoregressive models 
estimation of the spectrum in general. 


Consider first how we might estimate the mean, n вау. Since E(vj) = m, we could 


take the mean of the sample 


This has a variance 
x п $ 
1+2 Ў (1 = э} e (23) 


which may often be taken in large samples to be 


(2.3) will be larger than o?n-!, the value if there 
ator but usually not a most efficient one 
The most efficient estimator is, in fact, a weighted sum Br the аң, 
the serial correlations which have themselves to be estimated, 
ances it is the sample mean which " used. To estimate the 

y TE which is consistent but usually slightly 

А estimation of means and variances see Jowett: 


In most economie cases 


Here о? is Var (2%)- 
м is an unbiassed estim 


is no serial correlation. 
except asymptotically. 
the weights depending on 
For this reason in most cireumst 
variance of the process We may use (%—1 
biassed. For & further discussion of th 


(1955). 
al dependence and the natural thing to do 


an test for set і А 
which may be defined in a variety of 


relation coefficient 1 | 
от ere differ only slightly. We might, for example, define 


ЛҮ. be the product moment correaltion between 
°8 е 
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the serial correlation coefficient of orde 
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the pairs of values (21, X41). +++ (Tas: Yn): but this gives an awkward denominator and it is 
_ preferable to use 
n—s + 5 
Ў (ивы) 


T, = n(n—s) - e 04) 


n 
X (m—a) 
tel 


n Li u 
where 2 = n7 У w, This is the form used when the true mean of the process is unknown 
t=1 


as is almost invariably the case in econometrics. A simpler distributional theory 18 
obtained if we modify this to the ‘circular’ definition by writing 


Az 


(%—2)(24s—) 


= 24 ... (2.5) 


T. = 


1 


È (m—z)? 
t=1 


where we defino z,,, = д. A large number of papers have been written on the distributions 
of the quantities defined by (2.4) and (2.5) both in the null case when the series (aj) is 
completely random and under various alternative hypotheses. Good bibliographies will 
be found in the papers by Watson (1956), Daniels (1956) and Jenkins (1954 and 1956), who 
also consider partial serial correlations. The exact distributions obtained are invariably 
awkward analytically, changing their form at a discrete set of points throughout their range, 
and much effort has gone into providing good approximations. Moreover the effect of 


correcting for the sample mean in the non-null case is not nearly so simple as for ordinary ` 
correlation coefficients. 


However a test is not difficult to apply in practice since tables exist for the distribution 
of r, with a suitable definition. For example if, instead of taking (2.4) or (2.5), we define 71 
by , 


E- (а-я) 3 (ву (%j;—®)(x;_,—2) 


n . s (6) 
Ў (и 

T. W. Anderson (1948) gives one sided 5%, 1% and 0.1%, 
n =4, 5, ... 60. This definition of r, 


square successive difference 


levels of significance for 
can be expressed in terms of von Neumann’s mean 


La n(n—1)- $e te 2n( 
5 == 3 = <n(n—1)-1(] — 
j 2 (4—2) ы. 


Tables for the distribution of 8°/5? were given by B. I. Hart (1942), from which Anderson’s 
were calculated. Alternatively R. L. Anderson (1942) hag given tables for the exact distri- 
bution of 7, as given by (2.5). Watson has pointed out in his thesis that the distribution of 
Tı, when defined by (2.6) is almost exactly that of the ordinary correlation coefficient as based 
On 2+3 pairs of observations. 
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Т. W. Anderson (1948) has considered in what sense these tests are optimal. If we 
e hypothesis that the c, are generated by the simple Markov process 
and the {e} are a completely random series, no uniformly 
However, the test will be close 


take the alternativ 
ж, = prte where |p| <1 
most powerful test exists even for one sided alternatives. 
for all these differing definitions of r, give numerical values close to each 
Anderson (1948) p. 108) to provide a uniformly most powerful 


other, and (2.6) is known (T. W. 
one sided test against an alternative hypothesis about the distribution of the x's which only 
iven by the above definitions is close 


differs slightly from the above one. Moreover, 7, as gi 

to the maximum likelihood estimator of p for a simple Markov process generated by 
(a—m) = p(t, =m) +E when mis unknown. Thus tests based on the first order serial corre- 
nt or von Neumann’s ratio may be regarded in practice as optimal. How- 
es consists of independent random variables with 
To distinguish trend from serial correlation we 
al correlation test which will be discussed 


to an optimum one, 


lation coefficie: 
ever, it should also be noticed that if the seri 
a trend, 7, will tend to appear significant. 


therefore need carry out a joint regression and seri 


later. 
Tt is of some interest to consider the distribution of 7, in non-null cases. Most of the 
larly correlated’ joint distribution of the ws. Even 


as been for a 'circu 
this distribution are complicated. However, ш the case where 
n лего mean, Jenkins (1954) has given a simple 


ery good approximation to the distribution 


work done on this h 
approximate forms of 
ар = pra F€, 80 that the z's have a know 
and accurate transformation which gives à У 


of y = sin-!r,, where 


In this case he shows that the moments are given by 


9 P wi eae (n—29?)n7?4-0(n), 


м —3ü-p* م1‎ 


aL @—50°) p400) 
ит: 2—2) 2 
из = O(n), а = 320(7): 
atisfactor he tanh-* transformation introduced by 
Б cw mee when lel < 0.9. When the true mean is not known 
жү x ай. we make à correction by replacing ® by n—1. However, 
"in the error is likely © be O(n?) and little difference may be 

ion, А 


This appears to be rather m ry than t 
Quenouille (1948, P 
these results do not ho 
if we make this correct 
expected in practice. definition (2.4) at loast) for small samples, quite 
shown empirically by Orcutt (1948) and discussed 
M.G. Kendall (1948). This is a matter of some 


pina significance test for correlation between 


noticed that 7; is, (оп 
orofp. This has been 
d Pope (1954) and 
mated value of 


It should be 
biassed as an estimat 
theoretically by Marriott anc ^ 
importance when using ap esti 
ater. 


series, as will be seen 1 
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A quite different approach to the problem of testing for serial correlation is due to 
Ogawara (1951). Suppose that the null hypothesis is that the asquenee {xn} is 35 gamypletoly 
random sequence, normally distributed and the alternative hypothesis is that it is a Markov 
process, i.e., that it is generated by a relationship of tig form (x,—m) = Ra AT 
where {e,} is a completely random series of normal variates with zero mean, a nd m is unknown. 
This process has the Markov property that the probability distribution of the whole of the 
series (zy) for N > n, given the value of z,, is independent of the values of хи, for m < n . In 
fact we can describe it as a ‘nearest neighbour’ process which means that the conditional 
distribution of z,, given z, , and z,44, is independent of all the x; with i < n—1 or i > 
n+1. Ogawara’s idea is to calculate, given a series x,,..., z, (п odd), the regression of 
the 273 with an even suffix оп the means of the neighbouring 273. This can then be tested 
in a (-distribution in the ordinary way. An equivalent way of doing this test is to calculate 
the ordinary correlation coefficient between the series ta, 20, ... and the corresponding 
values (214-23), (23-+a;),... This has the ordinary distribution of ras based on 1(n— 1) 
pairs of observation. Hannan (1955) showed that this gives a test of the hypothesis p — 0 
which is asymptotically efficient (for a definition of this expression see Pitman (1948), 
Noether (1955) and Hannan (1956)), but it is not asymptotically efficient as a test of the 
hypothesis р = py 0. Ogawara’s test has the advantage of being exact and having 
significance levels which are easily found from t-tables; but since the distribution of 7, or of 
von Neumann’s ratio has also been tabulated, it does not offer any real advantage in the 


present case. It can, however, be useful when extended to more complicated situations. 


A more important problem is to be able to test for parti 


al serial correlation. If 
we have a simple Markov process generated by a rel 


ationship of the form т, = pXy id-6n 


we have p, = p' and if we define the partial serial correlation coefficient Po by the relation- 
ship 
— рарї 
р . (2.7 
l—pi en 


we find py; = 0. This is not ordinaril 


y the ease for a higher order 
and so the sample analogue of fa 


autoregressive series 


fa—ri 


Ping 
Eg a 
1—7? 


сап be used as a test of the hypothesis that the Process is a simple Markov process against 
the alternative hypothesis that further terms need to be included in the generating relation. 
This is in fact an approximate likelihood ratio criterion. This method can be extended to 
higher order schemes to test a null hypothesis of extent % 


: against one of extent 4-1 as ori- 
ginally suggested by Yule. Jenkins (1956) has given a smoothed (i.e. approximate) form for 
the distribution of such a serial partial correlation coefficient of order k with the effect of the 
k—1 intermediate terms removed. The sample serial correlations used are circular and cor- 
rected for the mean (if the true mean is known, the results are similar). It is then found that 
the form of the distribution depends on whether Ё is even or odd but the significance level 
of any observed value can be calculated from tables z 


я s of the incomplete Beta function. A 
discussion of the extensive analytical theory required to establish these results will be 


found in the papers of Watson (1956), Daniels (1956), and Jenkins (1956). 
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Thus to fit an autoregressive scheme we may calculate in turn successively higher 
order partial serial correlation coefficients (not forgetting the ‘Fallacy of Many Tests’) and 
then having decided on the order, estimate the coefficients of the scheme by least. squares 
using the observed correlations. 

Tt is however possible to go further. Given an estimated autoregressive scheme all 


the higher order serial correlations can be estimated and if plotted against their order repre- 
sent the estimated ‘correlogram’. The fit of the observed correlogram to this can be test- 
ed by goodness of fit criteria, the first of which is due to Quenouille (1947). These tests have 
been further developed by Bartlett and Diananda (1950), and Walker (1950 and 1952) and 
extended to two variate processes by Bartlett and Rajalakshman (1953). Such an overall 
test of goodness of fit! may show up inadequacies in a model which do not appear when we 
simply calculate a few higher order partial correlations. For example Bartlett (1954) has 
s of 114 annual values of the logarithms of Canadian lynx trapped in the 


shown that a serie 
Mackenzie River district in North West Canada which appear to be well fitted by a second 


order autoregressive scheme (Moran, 1953) when we consider lower order partial serial corre- 
lations, nevertheless show a strongly significant divergence from the estimated correlogram, 
when goodness of fit test is used. 

The failure of models based on concealed strictly periodic elements led to the aban- 
donment of the use of the periodogram in the analysis of processes. However, more recently 
it has been realised that the periodogram can bea useful method. In fact just as the popula- 
tion correlogram defined by (p, $ = 0, + 1,...} and the integrated spectrum defined by 
з ationship to each other given by the Wold- 


W(0) are in a kind of Fourier transform rel ie 
so also are the corresponding sample quantities. If we have an 


vn zero mean, the periodogram ordinate, Ip for a given 


What this implies is not clear. 


Khintchine theorem, 


observed series {X,}with, say, а kno 
value of p is defined by TOS 
T p A Ba 
where 
n 9, 
2) S^ x, cos ( 2727 ) 
dp = \ я es ( ITE 
1 


2) < x" 2mpl ) 
&- рде | n b 
: n-8 
to be equal to (n—5)?. 2, Xt Хы» we have 


A 1 
If we take the serial covariance, Gi, : 
n-l 2 
2 1— Isl) c cos ( 272: ) 
Гад 2 p» т 
пы 


Е(1,) = 2g? 


and 
8=т-- 


ting in the analysis of empirical series see Whittle 


2 ume 
iFor further developments in hypothesis te 


(75 Р 
), (77). 113 
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which shows the relationship between the sample values. Instead of considering the serial 
correlations we may attempt to estimate the spectrum directly by using Jp. In most cases 
the true spectral density exists and is а fairly smooth function. However, the graph of /, is 
usually very irregular, since I, and I, have approximately zero correlation for р, 9 integral. 
Moreover, as the length of the series increases, J, does not converge in probability to its ex- 
pected values. Various methods of smoothing the periodogram have therefore been sug- 
gested by Bartlett (1950 and 1954), Grenander (1951) Grenander and Rosenblatt (1952 and 
1954) and others (see also Bartlett and Medhi, (1955)). Having estimated a spectrum we may 
then consider a goodness of fit test of Kolmogoroff's type to see if it diverges significantly 
from some estimated spectrum, such as might be obtained, for example, by fitting an autore- 
gressive scheme (Grenander and Rosenblatt, 1954). Unfortunately such tests may be easily 


upset by the fact that the hypothetical spectrum has to be estimated (Kac, Kiefer and Wol- 
fowitz, 1955). 


The above survey of methods of analysing a single variate process is somewhat 
sketchy but covers rather more than the econometrician is likely to use in practice as he 
will very rarely have a series of more than 50 terms and often only 20 or 30. He will clearly 
have to test such series for serial dependence and perhaps estimate a first or even second 
order autoregressive scheme but with series of such a length the calculation of a periodogram 
and the testing of the goodness of fit of a periodogram or correlogram would not seem to 
be very useful. The natural scientist, with long series to deal with and probably also а 
physical theory of the processes producing such series, will find all the above techniques 
useful, especially since his prime concern may well be to understand the structure of the 
process. The econometrician, on the other hand, will be primarily concerned to determine 
how far his tests of significance and estimation procedures may be upset by the fact that he 


is not dealing with series of independent variates. Moreover his real concern is usually 


with the relationships between several series rather than the analysis of a single variate 
process, 


| This is perhaps a suitable point to make some remarks about the problem of predic- 
tion. Clearly prediction is only possible if serial dependence exists and if we assume that the 
process is Gaussian, so that if any set of the a is distributed in a multivariate normat distri- 


bution, the optimum estimator will be a linear form in all the values of a, which have been 


already observed. This linear form can be estimated by least squares and if the underlying 
process is an autoregressive one, it will have coefficients equal to the coefficients of the esti- 


mated autoregressive relation if the prediction is for one time interval ahead. Prediction 
for larger intervals ahead can'be obtained by repeating the prediction. The a in such à 
prediction are of two kinds—the error resulting from not estimating the coefficients in the 
prediction formula exactly, and the error arising from the fact that the future of the process is 
not uniquely determined by its past. (Formulae for these errors are given 33 Stone (1947))- 
In most cases the latter error is so large that prediction from a single series would be rarely 
worth while in economics where the serial correlations are usually just large enough to make 
ordinary statistical methods inapplicable but not large enough to make prediction useful. 
When. predictions may be useful in economies, these are in dealing with processes involving 
more than one variable. The optimal linear predictor in the sense of least squares is then 
estimated, as before, by least squares and the same type of theory is applied, only the details 
of the calculations being more complicated. It should be emphasised that the fact that the 
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в niet; e in the class of all linear predictors does not depend on any 
tons: û lying structure of the process. It will be optimal in the class of 
all predictors if the process is a Gaussian one but it is quite possible that some economic 
processes may be better represented by non-linear relationships and for these non-linear 
predictors those may be better. The theory of processes with non-linear structures is, 


however, almost completely unexplored. 


A problem of considerable importance to which little attention in this connection 
has been paid is that of superimposed error. Suppose that an observed process (2;) is gene- 
rated by the relations ау = %+2t yt = PYt-1t €t where z; and e, are independent processes 
ariates with zero means and standard deviations сү and оъ. Then (yj 
which is not observed, is a simple Markov process with zero mean and standard deviation 
©з where т? = c£(1—p)^. The process 4, will have standard deviation т, where 
of = 024-0? and if we write А = ogo, < 1, its serial correlation coefficients will be 
|1,...). This is not a simple Markov process since, for example, рь is 
hether с; = 0 is not an easy one. Tf one can be 
= 0 or e, > 0 one could use the partial 


of independent у 


Ps = р; = Ар*(8 = А 
not equal to pł. The problem of testing w 
sure that the process is of the above form with тү 


serial correlation 


ell become large not because 71 >0 but because the 


process is really, say, generated by an autoregressive scheme of higher order. The problem 
of an adequate test is not therefore solved. However, our troubles do not end here. Suppose 
we wish to estimate p, тү and 0%. We now get an identification problem ofa DES type. 
For if p = 0, e, and оз are individually not identifiable and we can only estimate citet. 
Therefore any attempt to estimate с; and Fg can succeed, only ifp +0 which we can decide 
only by a statistical test for serial correlation and this does not go a certain result. Thus 
e eed or not with our attempt at estimation depends on a prior test, 
blem of statistical inference and may also lead 
all the worse by the short length of the series 
must frequently occur in eco- 


as a test criterion. However, this may w 


whether we should proc 
aises a rather peculiar pro 


to considerable biasses. This situation is made A 
concerned. This type of situation 
deeper investigation. 

correlation in small samples does not seem 
e have sample series of simple Markov 
f the processes are known, so that we 
nate distribution of y= gin! r, (Jenkins, 1954). We then find 
ts of the distribution of sin-ir, аге 0.411 and 0.323 res- 
der how large P must be to give a 50% chance of 
on the value of p such that E(sinr,) 
Thus if we have a series of 15 


a situation which r 


With which we are usually 
Nometrics and deserves much 
e uncertainty in tests for 


To illustrate this suppose W 
25 and that the true mean о 


The extent of th 
to be widely realized. 
process of lengths 15 and 
can use Jenkin's approxi! 


that the one sided five per cent poin 
р= 0. Let us now consi : 
To do this we find by interpolati 


i 440 and 0.337. 
15 equal to 0.411 and 0.323, and т n than 0.440, we have more than a 50% chance 
ion coe s 


terms with a serial correlat But a serial correlation 0.4 can have quite 

of not judging the serial correlation significant. - pir tion we may wish to apply later 

а consid n other tests or methods of estimat à : y ae. 

и ngerous procedure of accepting a hypothesis, 
8 


This is, of course, nothing other t 


pectively, when 
exceeding these values. 


han the da 
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because a test of significance on the data does not reject it. 8 is very бошдан in econo- 
metric work and appears to be unavoidable. Ап example occurs in Stone (1954). In dealing 
with a large number of economie series each of 19 terms Stone suggests that the series Шз 
be transformed by taking first differences, i.e., if the original series is {xp}, he uses {yz} nma j 
and suggests, as result of applying von Neumann’s ratio, that (yj) can be approximately 
regarded as series of independent terms. This of course implies that the original pee 
not a stationary one. However, since Stone's series are only of 19 terms there might well 


be a non-negligible serial correlation in {ш}, which does not show up in the test. Clearly 
with series of this length the econometrician can only use the best methods available and 
hopes that the résults will not be too far out, but the uncertainty in the procedure should at 
least cause him to regard his final results with more uncertainty than their nominal 
standard errors would suggest. 


Tt would seem that many of the methods used in the analysis of economic time series 
are not very “robust” with respect to serial dependence. This is a more serious matter than 
their non-robustness with respect to the assumption of normality. If there are serious 
doubts about the latter something can be done by using a transformation or by the use of 
parametric methods; for example Wald and Wolfowitz have given a non-parametric test for 
serial correlation based on the universe of all n! permutations of the observed values. 


Examples in economic literature of serial correlation analysis applied to a single series 
are not common. Some are given in Davis (1941), Kendall (1946) and Kendall (1953) (stock 
prices). These are mostly quite long series obtained either by removing a trend with a mov- 
ing average from annual data as in the case of the Beveridge wheat series, or by using weekly 
values as in Kendall’s second paper. However, the econometrician is rarely faced with ana- 
lysing a long single series of such a type. His series are usually short and not taken in isola- 
tion but in relation to one or more other series. Much of the above theory is therefore not 
directly useful and has been included here, chiefly because it is a necessary preliminary to 
the joint analysis of two or more series to which we now turn, 


3. MANY VARIABLE PROCESSES 


We now consider the problem of dealing with situations where at each time instant 
t we observe several variables %, yj... and our approach will be determined by what we 
decide to regard as variates and what as fixed variables—the distinction made in Section 
1 between correlation problems and regression problems. Thus, to take the simplest case, 
if we are given a series (z;, уу} we may try to analyse this by regarding both x, and y as random 
variables ог we may take the values of « (say) as fixed and base our inferences on a model, 
in which the probability reference set is the joint probability distribution of all the y's, the 
2’s remaining constant at their fixed observed values. Which of these procedures we adopt 
depends on the kinds of question we want to answer and also on what we know, from non- 
statistical evidence, of the nature of the model. Even if it is clear that all the variables are 
best regarded as variates, it is often useful to base our statistical inferences on the conditional 
probability distribution obtained by holding some of them fixed. 


This distinction is related to another convenient distinction of variables into “endo- 
genous” and “exogenous”. This distinction is based on a prior knowledge of the structure 
of the process. An exogenous variable is one which may influence the endogenous variables 
but is not influenced by them. In any particular case this distinction is not .an absolute 
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one but is relative to th i < 7 i 

gets oo e шот asked. Thus č; y; and 2, might be taken as the tempera- 
: > raindall anc je price o eer. x 7 ^ 
"dre AY > І г. Clearly 2, cannot affect x, and y but may be affected 
= , š көб scien E will therefore take x; and y, as exogenous variables and z, as 
ndogenous, whilst the physical scientist might also be interested in the relationship between 
2, and This distinction is therefore principally important i idi 

к. - J. This distinction is therefore principally important in deciding on the model but 
also in many cases it 15 the exogenous variables which are taken to be fixed variables 
Whether this is so or not in any prediction problem will depend on whether the prediction 
involves knowing the future values of the exogenous variables or whether they, also, have 


to be predicted. 


series of observations we may ask what evidence there 


and, if this is forthcoming, what we can say 
about the kinds of model which may produced the series. Before we 
can do this, we must consider the theory of mathematical models which could generate the 


series, We take such models to be either stationary orsuch that their lack of stationarity 
from the- variation of the mean with some observed variablé or variables— 
This variation might be n the removal of its effect 


upposed to be strictl 


Given, then, a multivariate 


is for any dependence between the variables 


be supposed to have 


comes solely 
usually time. 


linear or not, but о 


y stationary. The variates will therefore be taken 


the process will be 8 
d moments. 


to have zero means and bounded secon 
re variates and if there are p of them 


ariables à 
a transpose 


case where all the v 
with p components and 


We begin with the 
lumn vector x, 


we represent them by à random со 


x; = (a? -- a). 


e serial covariance matrix to be 


For any given S(= D; зе) 90 define th 


Е en a 
jk 5 uec : B 
(Gi) = Ех) = (054) (97 He, 

{ OF лү С? 
where e; = Ва 0) = CH. 
hich generalises the Wold-Khintchine 
ditions that any given. set of matrices 
tationary p-variate process are that 
e defined and of bounded variation 


o Cramér (1940) ™ 


We then have a theorem due ti ! 
theorem and states that the necessary and sufficient con 
(Ci) (s = 0, +1, ...) be the covariance aa rd i 
thererexist p? (possibly complex) functions (8) which 


for л «70 < папа such that 
a 1 [uem aas 0). 

Jk е. бю UE 

e 20 | 


g For all the cases with which we will be 


and non-decreasin, 
function 


For j— k, Wy (0) is rel 
riance generating 


concerned the matrix cova 
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ү” 1 çist and 
will be convergent їп ап annulus 1—8 < lz] < 14-908 > 0) and Иб) will exist ene 
be given by 

о 
(8) = ЕЭ Д 
-0 


Suppose we now define another vector process (yj) as a moving matrix average of the 
form 


у, = Аџх,-А;х; + ... 


where each А; is a px p matrix, such that each of the p? series formed by the (i, j) elements 
of the A; is an absolutely convergent series (much wider conditions are sufficient). Then if 


the matrix covariance generating function of the {у} process is denoted by 5, (2), We 
easily see that 


. S,(2) = ( А2" ) S(z) ( 5 А," ) 
m-0 ШР] 


We may define the vector analogue of the stationary autoregressive process in one 
variate by the equation 


X,+ A,X, ... БАџх = 7, wee (81) 


where the A's are px р matrices and , is a column vector of variates whose variance-covari- 
ance matrix is B = (b,;) and such that the 35 for different values of ¢ are all independent. 
To ensure that such an equation will generate a stationary process it is necessary to impose 
some condition on the A’s and this is found to be that the roots of the equation 


1+ Fie 


i=1 


must all lie outside the unit circle 12| = 1. When this is satisfied, the matrix covariance 
generating function of the {ду} process is found to be 


(Y Ag )'»( TN" y 


=1 i-1 


since this now exists inside some annulus 1—8 < |z| < 14-0 (ô > 0). 
we see that by applying suitable opera- 


| possible to eliminate all the a ’s except 
one. The resulting equation then looks like an ordinary autoregressive equation on the left 


hand side (of order at most pk) but on the right hand side we will have a 
the components of %,. Thus in general an individual component, ж), in a multivariate system 
defined by an equation like (3.1) is generated by a process which is not of autoregressive 
type. Notice also, that it is possible to generate intrinsically oscillator 
equation of type (3.1) when k = 1 which is not the case for a sca] 
For further developments in the theory 


If we now return to the matrix equation (3.1), 
tors to the p scalar equations, which it represents, it is 


moving average of 


y processes by ап 


i ar autoregressive model. 
of multiple processes see Whittle (1953). 
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Mann and Wald (1943) have considered the estimation problem for processes of the 


Bu ences (i zc Se: 
above kind. In order to include estimation of the means we write the process in the mor 
e 


general form 
x, Ах К. +Azx -rta = Ў, 


The estimation procedure now depends essentially on 
independent vectors and in fitting a system of type (3.1) 


h that the resulting estimated 7, can be regarded 


is the choice of k sue 
a matter of some difficulty which deserves further investigation. 
Mann and Wald). 


pose Ё known and then distinguish between two cases’s ( 
assumption that the elements of 7, have a diagonal 
Tf we were to assume fur- 
um likelihood estimators of 
and therefore coincide with 


where x, no longer has zero mean. 
the assumption that the 7], are serially 
to a set of observatior 
as serially independent is 
However, we sup 
In the first we make the further 
c) which is, 
re normally distributed, the maxinr 
the sum of squares Z 7 1; 
However even if the 7: css not normally ‘distributed (but 
estimators will be consistent and asymptotically normally 
an be estimated. 


ding of Mann and Wald’s 
nt to minimising the sums 


variance-covariance matrix ( of course, unknown. 


ther that the elements of 7, 2 
the A's and a are found by minimising 


the least square estimates. 
these 


have finite moments), 
and covariances which с 


distributed with variances 
made here which may help the rea 
f squares = 7, 77; is equivale 
vidual equations. This is no longer true when 
be pointed out that the maximisation of the 
The probability distribution 
..1--й kept fixed. Ву doing . 


Two remarks might be 
misation of the sum о 
ances in the indi 
Moreover it must 


paper. The mini 
of squares of the disturb 
(a) isnot а diagonal matrix. 
likelihood of the observed series is carried out conditionally. 


is that of the x, for t = 1,... В with the values for t = O. 
oid the difficulties ith the Jacobian. 


s an arbitrary yariance-covariance m 
distribution for the components of 
kind of way and these are again equa 
ion, even if the distribution of 7, 
In both this and the previous c 


this we av connected W 

atrix, we may, assuming an 
each V, find maximum 
] to least squares esti- 
is not a multivariate 


ase these estimators 


If we assume (cij) i 


unknown multivariate normal 


likelihood estimators in the same 
mators. Their joint limiting distributi 
nultivariate normal. 


normal, is again à ! 
ficient. 


are asymtotically e 

In practice the situation is usually more complicated and we have а system ofthe form 
три ice > situe $ us \ 

4+В,2+ ee Bos = h (3.2) 


жле ASR 
ariates which are regarded as fixed. (This is slightly 
ne of the components of z, 


ird case). By putting 0 
above where we have to estimate a mean. 


вз is to be gener 
Premultiplying by Аб’ we get 


Ах! -- А, 


exogenous У 
ıd Wald's th 
ase considered 
t since the proce 

e A, to be non-singular. 


where the z, are vectors of 
more general than Mann à! 
equal to unity; this includes the ¢ 


We suppose that A, iS unknown, bu 
3.2), we must tak 


ated sequentially in time 


by the equation ( 
the “reduced” form 
pag Bee = "Я: (833) 


Е NUT JE 

x, Ag AX ade FAO 1A x) AO Byz 
"ald's sec case to estimate Az! A. cr 
and we can apply the theory of Mann and W Mies second = Ау... АВ, 
i er : i 3.3 3.2). 
The problem now is to return from (9:3) to (3:2) 
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Clearly this cannot be done without further assumptions. We can write our estimate 
of equation (3.3) in the form 


-— ... 84) 


where o, is а column vector of all the “predetermined” variables in xj... Же. 210—6 


arranged in a column, С is a nx (k--D matrix which has been estimated, and e, = Ag! Vr 
From (3.4) we wish to return to 


Ах = Doy, e (3.5) 


where D — A,C. Consider the i-th row of A, and call it aj. The corresponding row 
d; —a;Q. If we impose the condition that n—1 of the components of d; are zero (п is 
the number of components in ау), we have р—1 equations а;С; = 0 where j takes n—1 values 
and С, is the j-th column of C. Then a; will be determined except for a constant numerical 
factor. It follows that if in each equation of the original system (3.2) we can prescribe 
exactly n—1 predetermined variables which do not occur in that equation we have complete 
identification of A, (except for multiplication by a diagonal matrix) and so we can use least 


squares methods to solve the problem. This can also be done if the restrictions consist of 
^»—1 more general linear relations. 


If there are more than n—1 linear restrictions on some or all of the rows of D, the 
system is "over-identifiel" and there is no non-zero solution by the above method. It is 
then necessary to maximise the likelihood of the whole system under these constraints 
and this leads to much more complicated calculations. If one equation is just identified 
and some or all of the other equations are over-identified we could proceed as before for the 
estimation of this single equation but this, which is known as the “limited information" 


method, is not fully efficient since the extra restrictions on the other equations provide some 
more information. 


| A clear account of the issues involved in this problem is given in pp. 292-296 of 
Stone (1954) and a detailed analysis in Koopmans (1950) (see also Koopmans and Hood 


(1953)). 


le e ane imposed in onder to make equation (3.2) identifiable are usually linear. 
may however occur that we have information only on the signs of some of the coefficients. 


In this case least squares methods would be very difficult to apply and а method of minimi- 
sation by using an analogue electronic machine may prove the best method 


| Turning away from the general problem of multivariate processes let us consider the 
simplest problems concerning the relationship hetween two observed series and ask first how 
. . : Wed INC. р 8 а 
correlation between such series can be tested. The fallacy of ascribing a direct causal con- 
са а. x: * 


nection between variates whose observed correlation is due solely 


: to a common factor is well- 
known and when this common factor is time, so that both the series of 1 1 a trend 
3 series of values have a tren, 


it is unlikely to deceive any statistician. This is in fact the basis of Yule’s (1926) first example 
of a “nonsense correlation" mentioned before. However, even if ‘both | 5 ; набе free 
the ordinary correlation coefficient test of the hypothesis that the mini cd cannot 
usually be applied and owing to the lack of serial indeper dence (in po : de and 
this fact still seems to be widely unrecognised amongst economists l ыы ig xample 
of earlier date is Beveridge (1944) p.410). Suppose we have two eri cia pira ч } (Yn? 
іопагу series {tn} ( 


with serial correlations p, and p’,. Then it is easy to show that asymptotically the variance 
as sally the variance 
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of the product moment correlation coefficient between them based on т pairs of observation: 
И dons 


is approximately (Bartlett (1935)), 
n-1 


(n—1)3 (o z 5 Jp }. = 8:8) 


$=1 


yplications the second factor in this formula will be substantially larger 
e test correspondingly reduced. The exact distribution of 7 
proximation would be to refer it 


In most economic ay 


than unity and the power of th 
ent case is not known; but presumably a good ар; 


in the pres 
distribution based on 


to tables of the ordinary 


n-1 


1+(n—1) {142 >, (1 = зл) ... (3H) 


sel 

and p, We cannot simply 
for it can be shown that 

1 factor is comparable to its mean. The best thing 


to do is to fit an autoregressive series of low order (if this can be done) and calculate ihe 
lation coefficients from the coefficients of this relation. If both 


e p, = Pi and p; = (p1)*; 80 that 


The difficulty is that we do not know р; 


pairs of observations. 
calculated from the observed series, 


substitute estimates ту, 75 
standard error of sampling of the seconc 


higher order serial corre 


processes are simple Markovian, we hav 


n-1 ; 
soo Жї Lad api) РО 
1+2», (1 = 3 ) ppt дуй 1/1 1 


821 
р-п +a 0А) 


iii d Ір 
which is asymptotically equal to ^ 1 
above. The snag about 
as pointed out before, 
th this precaution the 


and rj and proceed as 
assed in small samples, 
However Wi 


We could then substitute estimates ? 
this is that 7, and y; are usually strongly bi 
and we may substantially underestimate the variance ofr. 
method seems to work fairly well in practice. 

: ‚ Quenouille (1949). 

А better method is ae ИТ ШЫ per coefficients between a and y, with 
ay ca А очей. Hannan (1955) has shown that xs procedure 
1036 efficient test. Quenouille also suggested using à pun, 
ct of only ii (or gii) removed. Whilst this is 
totically most powerful test. Hannan also 
bf with the “degrees of freedom” cor- 


Tf we have two Markovian pro- 


DORRES {tn} ID wem 
the effects of ıı and 01-1 


results in an asymptotically n 
correlation between 2: and y, with the effe 


a valid procedure, it does not provide an asymp 
1 n 
shows that the use of the crude correlation coefficien 


rected by (3.7) is also inefficient. 


The alternative hypothesis envisage 
pe pinat © Yn = рәйп-а a 
This test depends 
der au 


e series are generated by equations - 
correlated but jointly serially in- 
Markovian character of the 


d here is that th 
with En and 7» 
essentially on the 
ve series, 
f his suggestions by a sampling 


of the form n 
dependent for di {ferent n. 
series, but if either OT both are higher ОГ 
presumably, be applied. Quenouille chec 


experiement but did not obtain any analytic 
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If the alternative hypothesis was that y, = du, where {xp} and {ip} are indepen- 
dent Markov processes with serial correlations p, and ps, Quenouitte's test is not fully efficient 


unless p, = рз. It will be noted that {yn} is not then a Markov process unless this condition 
SW 
is satisfied. 


A quite different approach due to Hannan (1955) makes use of the idea of Одалуатё 6 
test. Suppose {л} and {у} are simple autoregressive processes, {у} is Markovian and 
we have a sample of 2и--1 successive pairs (z, у). Then we calculate the partial 
correlation coefficient between ya, and х, (t = 1, ... т) with the effects of (yy 4-- Jar): Tat-1 
and ay, removed. If the series are normally distributed, this partial correlation coeffi- 
cient is distributed exactly as an crdinary correlation coefficient based on n—3 pairs of ob- 
servations. The interest of this test mainly lies in the fact that its distributio: is exactly 
known and does not require a knowledge of the serial correlations of the processes. Hannan 
(1955) has given a detailed discussion of the asymptotic power of this and the previously 
considered tests for various alternative hypotheses. From this it is clear that in most cases 
Quenouille’s method is more efficient than Hannan’s test, one exception being the case where 
{xj} and (yj) are correlated as a result of their residuals being correlated, and (aj) is a second 
order autoregressive process whose first partial serial correlation is large and positive whilst 
in most cases Hannan’s test is more efficient than the use of 7 and (8.7). 


Consider now what is likely to happen in practice. Suppose we are given two series 
{£n} {Yn} which, we assume, are either serially independent or generated by simple Markovian 
schemes. We wish to test whether the two series are independent or not and therefore have 
to decide whether both are serially correlated. To illustrate we suppose the true means known 
and the series to consist of n = 15 or 25 terms. Then we have aiready seen that with 
py = 0.440 (n= 15), р = 0.337( = 25) we only have а 50% chance of finding +, significant 
at the 5% level. Suppose we always decide that a series is serially dependent if r, does 


reach the 5% level. Then, if the two series are in fact independent, we have a 75% chance of 
deciding that at least one of the two series is serially independent and that we c 

use the ordinary correlation coefficient between the series for which we therefore take as 
5% significance level + 0.497 (n = 15) or -+ 0.389 (n = 25). In fact however the true serial 
correlation is in both series p, 


an therefore 


— pi and the correct approximate number of degrees of 
freedom is 
п-1 
п 1+2 3 ا‎ 
(1+ gH |р» —1 


(since the true means are known) which for n = 15 and 25 turns out to be 
that our true 5% levels are approximately 0.591 (n = 15), 0.432(n = 25) This shows the 
danger of the procedure here adopted. It would appear that it is better в a cmm 
method, whenever there is any likelihood at all that the processes may not boe been serially 
independent, provided we can be sure of the order of the autoregression. The efficiency of 
Quenouille’s method in detecting non-independence is almost certainly asymptotically unity 
ation. 

Finally attention should be drawn to the rather curious fact that in some cases it may 


be preferable and scientifically plausible to draw deductions about the correlation betwee? 
series without using any statistical method. Thus to take 


9.41 and 19.08 80 


and for small samples probably throws away about n-i of the inform: 


an extreme case suppose we have re- 
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corded, by some physical instrument such as an oscillograph, two series which are, as far as 
the eye can judge, simple sine waves of slightly different frequency. Then one can нв sure that 
they are not directly causally connected or correlated. An actual example similar to this is 
the question whether the observed cycle in the population of the Canadian lynx (a very 
definite cycle of about 10 years period) is related to the sunspot cycle (Moran, 1949). A 
statistical test of this hypothesis, e.g. by using Quenouille's method, would : equire a good deal 


of computation. However simple inspection of the data clearly shows that there can be no 


such dependence, for both cycles are very regular and are sometimes in phase and sometimes 
he cycle in the lynx was caused by that 


outofphase. Ifthe dependence were so strong that t 
in the sunspots, this could not happen. On the other hand it is sometimes possible to decide 


that there is some common cause at work in two series by simple inspection, even though 
an exact analysis might nof even show a significant result. Thus ifone compares the production 


of fox, Canadian lynx and Snowshoe rabbit furs from 1848-1908 given in Brouillette 


(1934), p. 168, it seems quite clear that some common factor is at work since the main 
s all oceur about the same time. Needless to say, such à judgement is 


a statistician with experience of the misleading inferences which can be 


correlated series. 


peaks in these serie 
best made only by 
drawn from serially к 
on rather than correlation is the appropriate 
model. We suppose that (y) is the series whose regression on {xj} is to be examined. If 
we assume that jy, = a+ Ban ten where {en} is another (unobserved) process, the first thing 
we want to do is to test whether {є„} iS serially correlated or not, since our method of analysis 
in the two cases will be different. The natural thing to do is to calculate the serial correlation 


coefficient of the residuals and use it as a test criterion. For the case of regression on à single 
and variance depend on the values of the 2’s (Moran, 1950). 


tigated by Durbin and Watson (1950 & 1951). They gave 
levels the uncertainty being due to the dependence of 
the distribution on the уа In a later paper Hannan (1957) shows that 
r variables are à finite polynomial, the true significance level is very near 
the upper bound given by Durbin and Watson, (the error being of order n insted of n2). 
On the other hand if the regressor variables are trigonometric polynomials, an exact solution 


is known (R. L. Anderson & T. W. Anderson, 1950). 


We now turn to the case where regressi 


variable series {ду} the mean 
The exact distribution was inves 
two exact bounds for the significance 
lues of the %- 


when the regresso 


e known, the best linear unbiassed 


idual process Wer 
1а be found by minimising the 


ations of the res 
) cou 


If the serial correl 
a multiple regression 


estimator for // (and similarly for 
quadratic form 
(y XB) T9—XP) 
umn vector of regression coefficients and. 
T is the variance-covariance matrix of the residuals. 
hat least squares is not an efficient proce- 
Watson (1955) and Watson and 
small error in the prescription 
of f and that in general the use of 
cedure unless {e} is serially un- 
functions of time (e.g., poly- 
This is just the case where 


where у’ is the row vector (yi. «++ Yn) В is the col 


X is the matrix of regressor variables. 
However Г is usually not known. Itis well-known t 
dure for estimating the regression coefficients m А 
Hannan (1956) have shown that even а FAR 
of T can lead to considerable inefficiency in the ee 
straightforward least squares (Г = 1) is not an efficien i. uem 
correlated. However if the regressor variables x ыт 
nomials), the least squares estimator is asymptotic® y "ent. 
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іоп і i roblem of serial correlation in regression 
i ial correlation is easiest. The pro е in | 
‘ae x been treated from the point of view of spectral theory by Grenander 
analysis hai 
(1954). 


Another approach to the problem of testing serial correlation in the puni 
a regression is based on Ogawara's idea (Hannan, 1955). If we add ae agin { » P han 
{æn} such that y, = c--f/fr,4-6, and wish to test whether {e} is serially om е 
against the alternative that it is generated by a simple Markov scheme, we test the ра = 
correlation of the {ya} with {Ya-1+Ya+} when the effects of {xy} and (xy 1t tary are 
removed. This is an asymptotically most efficient test. 


This completes what might be reasonably written about those aspects of the theory of 
stationary random process which are of direct interest to the economist. Needless to вау, 
a great deal of other work on these processes has been published, particularly for processes т 


р а сар ИНИ я 
which time is taken as a continuous variable, but much of this work is mainly of interest 
applications to other sciences. 


There are, however, a number of economic problems which involve random processes 
of special types. Such problems are either concerned with particular economie problems, 
as for example in Rutherford's (1955) stochastic model to explain income distributions 588 
also Bernadelli (1944)) or else are concerned with economic planning of particular industrial 
operations. The latter is a rapidly growing field with a closer relationship to “Qperational 
Research" than to pure economic theory. As examples we may instance the problem af 
inventory policies (for a survey see Ackoff, (1956)), the programming of hydroelectric 


systems (Massé (1946)), and the planning of transportation (Beckmann, McGuire and 
Winsten (1956)). | 
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EXPRESSIONS FOR THE LOWER BOUND TO CONFIDENCE 
COEFFICIENTS 


By SAIBAL KUMAR BANERJEE 
Indian Statistical Institute, Calcutta 
A lower bound to the probability of sample estimate plus (and minus) t- times 


ulation mean (or total) is derived for samples from non- 
he case of ratio estimates and multistage designs are 


SUMMARY. 
(t> 1) estimate of sampling error covering the pop 
normal populations. Extensions of the result to t! 


also considered. 


1. INTRODUCTION 


ariance of estimate indicates variability of the estimate and 
mate 2, plus (and minus) /-times estimate of 
f the cases where « is defined as 


Estimate of sampling V 
if the parent population is normal, sample esti 
sampling error covers population mean m in æ per cent о 


a UN 
5 an estimate of m and V(m)(computed 
^ 


of m, an expression for the lower 


^ 
on-normal population if m і 


mate of sampling variance 


For a sample from a n 


from sample readings) an esti 
may be of some interest. In a paper 


— ^u 
ү Pim) covers m, 
.,y be a samp 
а B,-coefficient Bs. then, for t > 1, 


^ 
bound to the probability that т: 


it was shown that if 21, 22, -- 
opulation with mean m an 


le of size т drawn at random 


(Banerjee, 1956) 
with replacement from ар 


X (v — zy > $ 
угор. [2—2 | <! М. [| № BRB 2 { а } : 
aol) n it (8—19in—1l rl 


stratified multistage design may be 
1 below which are all based upon 


are indicatec 
in sampling with equal probability is taken 


e result to the case of pps sampling in 


Some of the extensions 


‚ is seen that the role of Ba 10 | ist 
x = in sampling with unequal probability. Stratified design intro- 
T ә 


The case of ratio estimate (simple ratio for single stratum, and 
hed up without assuming bivariate normal distribution 
k 


Y Му 
: 1 

ity in Bt a 

x М, ix 


An extension of th 
of some interest. 

a simple lemma. 

over by a similar paran 
duces further parameters. 
combined ratio for strata) is also touc 
= population ratio to be estimated 


of y and =. A probability" inequal 
1 | | 

r n An extension to multistage design is 

| ieller's i ality is derived. 
of the same form as Fieller’s inequanty 
also indicated. x 
nction 0) 
11. Lemma: Let (xy So ap) be a fu 


LY} 0 and Еф?) exist. И 


p stochastic variates such that 


mites 
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Then 
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prob. { Фф 20] 2 Ete} 


1.2. Proof: Let a variate y be defined as 
, y=1, if $20, 


| =0, и" $<0. 

Obviously фу 29, 

tory My} > ВФ}, 

ог, = АФУ" > ФУР, Co Eli) > 0) 

ог, ^ E(j*) {у} > {pyy > ИФ" (Schwarz inequality) 
or. prob {¢ > 0) = (5) >. 


2. ONE STRATUM PPS SELECTION 


2.1, Let there be a finite population consisting of N units. Let y; denote variate 
value y of the i-th unit. Letn units be selected with replacement from М units, with probabi- 
lity proportional to some measure of the units. Let p, p, ..., py denote the probability for 
the different units to be selected in a particular draw. 


2.2. Letz, be an estimate of population total M as built up from s-th selected unit. 
Obviously » 


„= 


Pi 
if the s-th selected unit happens to be the i-th unit of the population. 


2.3. Let us define a function L of estimators z}, 2, ... 2, and М and (t > 1) ав 


b=" мр "T 
n ` 
i-a 
* [E 
where - Au E rani ga Ж, 
n—1l n 
2.4. It can be easily shown that 
BN =1 
aoe | 
EX(z—M)*} ^ 
B,—322 2A? . 
HA Еве Ep ад 
Un я ps 1+4? | 
$ (2.4.1) 
2... 2 
Вг му = 222—303 T 
В,\2-—3ЗА2 , ЗАЗ | 
me- =e 
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CIE. 


where N 
1= 2 »( У 
1 


and B 


7 
We have accordingly 


E(L) = (2—1) А 
[2 


MN (2.4.2) 
виз) = (8—10 [ 253 414 {At} 
n? n (2-12 \з—1 
From (1.1.1), (2.3.1) and (2.4.2), 
prob. {L> 0} > = раза КЕСАР un. 
BS pe TE +1} 
n (@—1)° (»—1 
"К... D А 1 
or, ‹ prob. { a+ >M- } > - 
^ >M > a 2 B,—3 114 KE 4 ü n (2.4.3) 
n (B—1Yln—l i 
alues of the lower bound to the probability 


2.5. Table 1 below gives numerical v 
BT? А 
a+ 5 Su Se А for {= 3 and sample size n= 4, 6, 8, 10, 12, 20, 30, 50, 100 for dif- 


ferent B,—values. 
^ ^ 


TABLE l. LOWER BOUND OF PROBABILITY OF THE INEQUALITY gu. 2pM2 ga: 
n m 
(values worked out from (2.4.3) taking t= 3) 


By-value 
4 6 
(5) (6) (7) (8) (8) (10) 
914 0.939 0.951 0.959 0.964 


(2) (3) (4) 


(1) 

L0 0.727 0.880 0.875 0.899 0. 

2.0 0.615 0,729 ‘0.789 o.ga5. 0.849. 01807 10.921 0.941 0.955 

3.0 0.533 0-050 0.718 0.762 0.703 0.859 0.894 0.023 0.946 

4.0 обат - 0,087 9.059 0.708 0.744 0.823 0.868 0.907 0.937 
0.609 0.661 0.700 0.791 0.844 0.891 0.929 


0.421 0.585 


5 
ЕЙ 


n Table 1 it is seer 


ing like 4.0 (or less); working with three 


ue in about 70.8 per cent of the 
pps sampling B, is likely {о be 


з that if В, be some th 


ror z + VP) 
10. If, however В, 
е times the samp: 


ges or more. 


val 
(in 
ling error 2 + 3 / Y) will cover 


Fror 
i i ill cover the true 
time the sampling er Ww 
н і be 3.0 or less 
cases (or more) ifn = 
again with thre! 
of the са 
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small in general) 
ut 76.2 рег cent 


the true value in abo 
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3. K STRATA PPS SELECTION 


3.1. Let there be К finite populations consisting of N,, Ng ..., Ng units. Let yij 
denote variate value y of the j-th unit of the i-th population. Let us consider a scheme of 
sampling where n; units are selected with replacement w ith probability proportional to some 


measure of the units from the i-th population (i = 1,2, ..., k). Let ру denote the probabi- 


lity that the j-th unit of the i-th population will appear in a particular selection w hilesampling 


Ni 
for units from the i-th population. Obviously, X pj; = i (for i= 1,2,..., k). 
ј=1 


3.9. Let гь be an estimate of population total M; of the i-th population as built 


up from the s-th selected unit, among n; units selected from the i-th population Obviously 


Уй‏ = و 
Pij‏ 
if the s-th selected unit happens to be j-th unit of the i-th population.‏ 


3.3. Let us define a function L of estimators 2, (8 = 1, 2, 
and M, M,... М, and P(t > 1) as 


k 
Я k k 2 
= (2 ص‎ S И. 3.3.1 
Sat ir шы. „н 
ny 
" È (2—21)? 
where Л, = 1L —— 
* та 1 
У, 218 
апа Bic P G= 1,9, we 1). 
ET 
34. We have L=)" БЕ i = =M} + J (4—M—Mj .. (841) 
1 
Er 
k 
Hence E(L) = (8—1) Ў) x (3.4.2) 
1 t 
N, 
where i= У (to =m) 
jel Pij (for i = 1, 2,.. . b). 
3.5. From (3.4.1) 
x [с^ 2 
=f E lo O (4245-2) \ + 
hjel J 
i 
+A 1 1% 2 (2—M;)(z z ... (8.5.1) 
E 
where | p= - ^i (s 4—M;) (for i= 1,2, ... В). 
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It can be easily shown 


DE tet С) 


= | — s. (852 

where By =?" р 5 (3.5.2) 
& 2 x X» 

№ еб 1 >}. ... (8.5.3) 
a fy }= (2-1 2. GEN X | zx 


us 


[5 dat fx (к—М) 65140) | | = ... (3.5.5) 


ЕУ 


3.6. From (3.5.1)— (3.5.5) it follows 


k я 2 А Л? - ۸:۱۶ 
Е(12) = (2-1 PLZ -»«(2:5 нете sim 0) a j 


1 


z 2 
T k 23 | x 
M 
= * m: mee ai y à | 
я (5.3 Re 
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From (1.1.1), (3.3.1), (3.4.2) and (3.6.1), 


(Бр 
prob. {2 > 0} > EQ) 
>=; 
2 B,,—3) 1 i 
2, (Вы >; nz(n;—1) 


die I Dm vp a. Qui 


k BA k IE A 
or, Уан / i? San У а- eat: 


1 1 1 
with probability equal to or greater than the right hand most expression of (3.6.2). 


3.7. If all the n;s are equal to n the expression for the lower bound takes the form 


pats 1 
н = (3.7.1) 
2: 8.9) т | F Уж | 
Па ai zu (#—1°% n—1l' doe A н 
(2) E EP 
1 
hich i lt p acr Ш Б 
which is equal to 4 (8, vere Е Lm | 2. (3.1.2) 
35 k 
№ og А Ly. 
iis B,= +, : E 2. D d To} +1 
IU ae 
1? 1 


Table 2 below gives numerical values of (3.7.2) for К = 8; n = 4,8 12 16: B, = 2, 3, 4 
and C.V.(A) = 0.0, 25.0, 50.0, 75.0, 100.0, 125.0, and 150.0. ho 
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TABLE 2. 


É 
X 
1 


LOWER BOUND OF THE PROBABILITY 


FOR A STRATIFIED DESIGN OF 8 STRATA AND 4, 8, 12 AND 16 UNITS 
PER STRATUM FOR DIFFERENT VALUES OF CV(1) 


(values worked out from (3.7.2) taking t — 3) 


CV(A) values as 


By-value 
0.0 25.0 50.0 75.0 100.0 125.0 150.0 
(0) (1) (2) (3) (4) (5) (6) (т) 
number of units per stratum = 4 
2.0 905 .901 .890 .872 .848 819 186 
3.0 880 875 860 336 .805 768 7128 
4.0 856 ‚850 .832 .803 . 766 724 678 
number of units per stratum = 8 
2.0 .943 .941 .936 .928 .917 903 887 
3.0 .929 .927 .919 .908 .892 872 .849 
4.0 .916 .913 .903 888 867 842 814 
number of units рег stratum = 12 
2.0 53 952 .949 .943 .936 927 917 
9 905 .889 
3.0 943 .942 .937 929 .918 5 
‚901 884 . 863 
4.0 934 932 .926 915 
number of units per stratum = 16 
E 5 .945 939 931 
2.0 957 .957 954 .951 
Г .932 921 .909 
3.0 .950 .949 .946 940 
2 918 905 .880 
9 937 929 7 
4.0 943 94 af 
and » units per stratum (n > 4), con- 


From table 2 it is sce 
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the same, the probability (as 
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3.8. With respect to stratified designs having a constant number of » units per 
stratum thereis another method ofestimation of sampling error and allied confidence interval 
for the population mean (or total) This method at times may be operationally easy and 
thus less costly in large scale tabulations. Hence in this context it may not be out of place 
to discuss that method. Denoting as before by z;,(s = 1,2, ... n; = 1, 2, ..., Е) estimate 
of population total M; as built up from the s-th selected unit from n units selected from the 
i-th population, a set of estimators and a function L, may be defined as 


в; = 23 fu es (8:8) 


E 
D, = ——~ ={й—® AM .. (8.8.2) 
L Ц (a : Mi) 
n 
X a, 
where Gi ж 
n 
It can be easily shown that 
[2 
— (ер W А 
xim O 
and UE -. E" PEN .. (3.83) 
1 0(B,—3) Rr f dr 
n ы ая T j 


where A;, B, and 0 are as defined earlier in paras 3.4 and. 3.7. 

Since 0 < 1, comparing (3.8.3) with (3.7.2) it is seen that (3.7.2) will always be greater than 

(3.8.3). Hence if judged only by this criterion (viz. the expression for the lower bound of 

probability of confidence statement being true) confidence statement of the form 
m k Ел 

@ +t J E (a,—ay is not to be preferred over ` а +t У. ^i . If, however, num- 

n(n—1) * V Gn dá : 


1 
ber of units per stratum is large (something like 16 or more) the second method may be 
used in preference over the first. 


4. ONE STRATUM PPS SELECTION P .IO ESTIMATE 


4.1. One stratum, pps selection, ratio estimate = T 
consisting of V units. Let y;, z; denote respectively variate 
the i-th unit. Let » units be selected with replacement from 
portional to some measure of the units. Let p,, Ds 
different units to be selected in a particular draw. 


et there be a finite population 
values of character y and x of 
N units with probability pro- 
* Py denote the probability for the 


42. Let 2, and w, be respectively estimates of population totals M, and М, of 
character y and v as built up from the s-th selected unit. Obviously 


-— — _ 


i i 
if the s-th selected happens to be the i-th unit of the population, 
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4.3. Let us define a function L of estimators 21,23, ... 24, W1, We; ... Wp and M, 
and M, and e(t > 1) as 


„ы 0 2 Я i^ E P | 
MES [Xt pro. r) | (z—Ru) "E 


1 
=È [ Ава, 2R A, -ERD (633) 
n 
n n 
X ow, 
Шеп Mis 2, iw 
where R ar. З: L- ü je 
n n А А А 
У (2—2) „ У (w,—w) چچ‎ _ bww-um 
a, = د‎ As وإ‎ N 


4.4. Treating 2,— Rw, as а variate and taking mathematical expectations of L 
Л 


and Z? it can be shown that probability {L > 0} > Po 
... (44.1) 


her 
where PQ— EL w)—3 Ste z aes nj 


(rl )' 


x E (4.4.2) 
where Blz, ш) = мо = Sine ; ; 
5 Pi ( Pi 
1 
Hence we have from (1.1.1), (43.1) and (4-4-1), 
a— Ro} ... (44.3) 


Fi РА 
"n 


mr s wa s" for R can be derived from (4.4.3) 
“eonfidence limits tor (4.4.3). 
à ee 4 4.3) a quadratic equation т В 
E brief the E og be indicated as under. E (4.4.3) a q 

| | i as under 
and a quadratic inequality in R can be derived as ш * | 
ip? —pAw) 9R (apr ИАА») )-+(2—pa,) = 0. (4.5.1) 

R(T — paw) — ^ 


Quadratie equation : D. м 
—2R 0~ m 
Quadratic inequality : Re —pÀ, ) (2 
Ё 
where oe 
„ clarity of exposition, may be 
1 values of Ё which satisfy (4.5.1) for clarity xp г 
Actual numerical vatu 
Consider ' wee à 
dered under three he у "^ rape mes 


135 


(à) mpå > 0: 


Vor. 21] SANKHYA : THE INDIAN JOURNAL OF STATISTICS [ Parts 1 & 2 
For (a), roots of equation (4.5.1) are real as the discriminant 


D = HT —pr V X, — G?—pÀ,yG*— pA) 
= Az (p(C mo — Ozu) HPU — PC wre) Crow + Czz—2C 20) } >0 


А, À rV ÀÀ 
where 0.2 = g $ б = 5 Сгш = Е) 2 
Hence from (4.5.2) limits for R are 
Ву << Re ... (4.5.3) 


where R, and R, are the roots of (4.5.1) such that R, > £f. 


For (b), limits for R are derivable from the relation 


g—pÀ, < 2R(20—prV As). Ж (4.5.4) 


Under (c) there can arise the sub-cases 


For (c.1) the discriminant of the equation (4.5.1) is positive and as such if R, and Jig 
be the roots of the equation (4.5.1), limits for R will be 
mec Еу ох, RU ... (4.5.5) 
where В, and А, will satisfy the relation 
R,>0> Ri. ... (4.5.6) 
For (c.2) depending upon the numerical values of t and r for given 2—ph, and 
@—pA, the discriminant will be 
either (с.21) positive 
or, (c.22) negative. 
For (¢.21) limits for R will be of the nature (4.5.5). For (с.22) as the roots of (4.5.1) 


will be imaginary any numerical value of E will satisfy (4.5.2) and as such limits for В 
derivable from (4.4.3) will be o > E > —oco. 


4.6. Limits of the nature (4.5.5) are practically useless and considering the very 
nature of the limits, the limits derivable from (4.4.3) cannot strictly be called confidence 
limits. Such limitations, however, apply equally to the bi-variate approach of Fieller a$ 


well. 
5. К STRATA PPS SELECTION COMBINED RATIO ESTIMATE 


5.1. K strata, pps selection, combined ratio estimate: Let there be К finite popu- 
lations consisting of NS №... №, units. Let ур, vij denote respectively variate values of 
character y and 2 of the j-th unit of the i-th population: Let us consider a scheme of sampling 
where n; units are selected with replacement with probability proportional to some measure 
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of the units from the i-th population (i = 1,2, ... k). Let ру denote the probability that 
the j-th unit of the i-th population will appear in a particular selection while sampling for 
units from the i-th population. 


Obviously 2,985— 1 for (¢ = 1, 2, OE 


=1 


> 


Let z; and wj, be respectively estimates of population totals M;, and M;, 


5.2. à 
ei th population as built up from the s-th selected unit among 


of character y and x of th 
^; units selected from the i-th population. Obviously 


appens to be the j-th unit of the i-th population. 


5.3. Let us define a function L of estimators zi, and wis (5 = 1, 2, ...m; = 1, 


Р T 2 1 
2, ... k) and Mj, and М» (i = 1,2, ... k) and #(t> 1) as 


if the s-th selected unit h 


kí X (uu) ES w (531) 
L=# re m — (iu) 
nj(n;—1) 1 


y { в R(wis—@i) } 


Е : т, } ... (5.3.9) 
EN. | > E [5 | 


| 
M- 


— | SSL у тав 
5 {вла Ai А HM >, m E } ( ) 


| 
M- 


1-1 
ni 
У wis 
7 sel 
fe.) ج ج‎ 
ا‎ Wis — Mis) ч т; 
Where ив = zi May = Bis i 
ni nj 
x x. X wis 
Көзү» sel — 
а = E à ni 
ni 
ni ES X (wis — Ti) 
Zis 4i LONG ть 
А EL CIEN М Л nj—1 
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ni Le X Ми 
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5.4. It can be easily shown that 


* E(D) = (8—1) z хә. с | (5.4.1) 
k 
; Apu) Бы—8 
and — БИлу=(#—1у{ > a 2 | | TE 
; : A i)? è 
| px 
k 


жа 1 (5.4.2) 
Pg ( Alu) ) 
т ni 
Ni 
where Аи) = S py ai EO) | ... (5.4.3) 
j=1 Pij 
Ni 
. [yg—R9u im. = ^ * 
s 2.0 [8554 -My вм. | | 
Bim: ' Е - es (64.4) 
Quy | | 
Hence from (1.1.1), (5.3.3), (5.4.1) and (5.4.2), 
| prob. {L>0}>P,, 
where Р,= oe = Е 1 250 
2(u; ja 
У 0 (ваз) „ул 
— ку Е 1+ 2 nz (n,—1) 1 
A(u;)\2 P=) LAS 
) Au) 2 
т (2 52.) 
1 ъ 
(5.4.5) 


5.5. Ме have from (5.3.3) and (5.4.5) 


Р H3 Nie pe А, _әр т; Viia | - k Ё k ^. 2 ББП) 
me ny mt E f Be aR > uS 
1 m 


with probability greater than (or equal to) P, where P, is given by (5.4.5). From (5.5.1) 
limits for В can be derived on the same lines as discussed earlier for the case of a single stratum. 
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6. EXTENSION TO MULTISTAGE DESIGNS 


` 6.1. One stratum, two stage design, pps selection: Let there be a finite popula- 
tion consisting of K first stage units, where the i-th unit contains N; second stage units. Let 
yi е value y of the j-th second stage unit of the i-th first stage unit (j = 1,2, ..., N;, 
isi, .. k). Let us consider a two stage sampling scheme where n first stage ini are 
Selec ded. w vith replacement from £ first units with probability proportional to some measure 
of the units. Let Py; Po: ... Pr denote the probability for the different first stage unit to be 
selected in a particular draw. Within each selected first stage unit let us select n, or ng 
Ог... n, second stage units (according as the selected first stage unit happens to be the Ist or 
214— or fth first stage unit) with replacement with probability proportional to some 
Measure of the second stage units. Let pi; denote the probability that the Hi second stage 
unit (of the i-th first stage unit) having variate value yu will appear in а particular selection 
While sampling for second stage units after the i-th first stage unit has been selected 
k 


Ni } 
(= 1,2, ... Wy, ¢ = 1,2, E Oviously X pj = 1, for i= 1,2,... 


6.2. Letz, be an estimate of population total M as built up from the s-th selected 
2, bea 


‘first stage unit. Obviously, 


ni 
1 1 tu 
ê ran e pi 


j age unit and and Yiq) 
if the s.th selected first stage unit happens to be the ia acere Uh 
Vi y. happen to be y-values of n; selected second stag 
Vito, ... yin; ha 
unit, à 

‘Zn М and #(¢ > 1).as 


е imators 21, 2» 9» °° 
6.3. Let us define a function Z of estim 


$ (4—2 EX, SO ED 


where org 
and (6.3-1) 


6.4. We have from (1.1.1) 


@— 
f Sea 2M g— Em | 
prob. i aes a(n—1) 
(6.4.1) 
ate") 
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5 Р = А ил(2) ТИРИ 
where B,(z) for the scheme of sampling considered is given as Ma) where 


ee)” 


на Xo (ox a X non] 
мы = >, n( M) л> s (27) 
HS mange жИН) PCT 
H2 as (90-м) 22m (№ cin) 


№; 
when M; stands for total of y-values of the i-th first stage unit i.e. M; = X yij, (i = 1, 9, ...k). 
ja 
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A PILOT HEALTH SURVEY IN WEST BENGAL—1955 
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es which have shown considerable progress in medical and 
ays remained somewhat inadequate and defective. In 
‘in the rural areas do not avail of any medical care. It 


SUMMARY. Even in those countri 
publie health fields morbidity statistics had alw: 
India a large mass of the population particularly 
be too much to expect from the data obtained through hospitals and other official agencies 
es of the morbidity pattern in this country. For obtaining a 
conditions one has, therefore, to depend upon other sources 
evolve suitable procedures for the 


will, therefore, 
to provide adequate and reliable statisti 
comprehensive pieture of the morbidity 
particularly sample surveys. The main object of this study was to 


colleetion of morbidity and medical care statistics by sample surveys. 


CHAPTER 1 


INTRODUCTION 


ation of the Births, Deaths and Marriages Act in 1886 marked 
al events on a voluntary basis throughout India. Since 
Special Acts for the compulsory registration of births 


1.1. The promulg 
the beginning of registration of vit 
then, some of the States have passed 


and deaths. 
ve elapsed since the enforcement of 


primitive stage with no appreciable 


n for apathy and lack of administrative vigilance. 
d. Although the 


1.2. Although three-fourth of a century ha 


registration in the country, the machinery is still in its 


improvement and is almost breaking dow 
The defects inherent in the system still continue without being rectifie 


reporting of vital events is a primary duty of the people, they are either ignorant of it 
very little use of birth and death cer- 


or indifferent to it. Moreover, there is practically h. ce 
tificates at the present time. The chowkidar or the village headman whose responsibility 
erburdened by his 


it is to report the vital events occurring in rural areas 1$ already oye | 
revenue and police as that the registration of vital events in rural 


signments with the result | 
areas does not get adequate care ог atte This has led to a gross under-registration 


ntion. 
of births and deaths which may be of the order of 50 per cent. In ino urban areas also 
Proper attention has not been paid for improving the system of registration. 

1.3. Further, so far as death registration is concerned, the reporting of the cause 
of death by the chowkidar who is generally illiterate or semi-literate is anything but satis- 
factory. Apart from à few well-known and easily recognisable и diseases 
like small-pox and plague; the returns obtained are practically is no use P ae purpose of 
health planning or research. The available vital statistics, there p) are wholly inadequate 

ctive health planning. 


and unreliable for carrying out any scientific research or effe 
y statistics is still more glaring. Even 


xisting morbidit; 
such events could not be considered as 


14. The deficiency in the e 


in statistically advanced countries, the recording of 
absolutely ae Nevertheless, in those countries à ceaseless effort goes on to perfect 


the machinery responsible for the collection of such statistics by inereasing the scope of 
Notifiable abe s a include a larger number of diseases and by supplementing the available 
x е; а 1а 
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morbidity data by highly specialised surveys. However, a change in the official attitude 
was discernible in this country since a decade or so and there has been an increasing realisa- 
tion of the importance and value of vital and health statistics in the fields of health planning 
and research. The Government of India, realising the magnitude of the health problems, 
set up а committee in 1943 known as the Health Survey and Development Committee under 
the chairmanship of Sir Joseph Bhore to review the then existing health conditions and the 
status of vital statistics in the country, and to suggest ways and means for their improve- 
ment. The recommendations contained in the report which was the result of a painstaking 
and extensive fact-finding study, though accepted in principle, have not, however, been 
implemented in full. Nevertheless, the Planning Commission of the Government of India 
has considered some of the aspects implied in the recommendations and introduced them 
with suitable modifications in the country's development programmes. 


1.5. A searching analysis of the available vital statistics has been made by the Com- 
mittee. In the chapter on 'Vital Statistics’, it says, ‘Preventive and curative work can be 
organized on a sound basis only on accurate knowledge regarding the diseases and disabilities 
prevailing in any area..... The organization of morbidity statistics for the community 
presents a difficult problem even in countries in which the development of health services 
has advanced much more than India, and figures for deaths in view of their greater complete- 
ness are generally utilized to a greater extent for the study of health problems, even: though 


the latter constitute more satisfactory material for such study. Tt is only when adequate 
medical services covering the w 


hole population and offering protection to all irrespective 
of their ability to pay for such pr 


otection, becomes established and operates over a reasonable 
period of time, that morbidity statistics of the requisite quality and quantity will develop.” 


1.6. The available information on morbidity is chiefly confined to those diseases 
Which are made notifiable to the health authorities. "This fact together with the inadequate 
medical care available to the population greatly restrict the extent and accuracy of such 
returns. The situation has been adequately summed up in the Report of the Health Survey 
and Development Committee (loc. cit.) which reads, “There are considerable variations in 
the number of communicable diseases which are notifiable in the different provinces . . · 
there do not exist, even in the large cities, adequate facilities for ensuring that some of these 


diseases, for example, tuberculosis, will be notified in sufficient numbers to ensure that ® 
substantial.proportion of the actual occurrences will be brought on record." 


1.7. The absence of reliable and adequate national statistics of the incidence of 

, diseases and injuries is being keenly felt by the health authorities, Perfection їп this 
direction can be achieved only in the long rum. For some years to come one will have to 
depend on morbidity and mortality statistics collected from ‘ad hoc’ surveys either on 


sample basis or a complete investigation in selected areas to ensure reliability and adequacy 
of the statistical material. 


1.8. Number of attempts have been made by public health organizations in recent 
years to assess the prevalence of certain chronic diseases such as tuberculosis, leprosy ete. 
Such surveys require trained medical personnel and elaborate laboratory facilities and in 
a country like India with limited resources the introduction of Such investigational methods 
on a national scale would be beset with practical difficulties, Further, these surveys covet 
only the so-called chronic diseases and leave entirely the acute diseases which in India form 
the bulk of the total morbidity. 
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1.9. Apart from the above few studies of morbidity from specific diseases, no 
comprehensive survey concerning morbidity and medical care on a national basis has 
hitherto been attempted. The only general health survey ever done in this country was 
the Singur Health Survey in 1944 which was confined to a small compact rural area in. West 
Bengal. The lack of enthusiasm on the part of the government to initiate comprehensive 
health surveys may be attributed to the meagre financial resources and to the shortage of 
trained personnel for carrying out such extensive undertakings. Even if these are forth- 
coming, a health survey by its very nature will still prove to be a difficult proposition due 
to the lack of knowledge of a suitable procedure for the collection of health statistics in the 
peculiar conditions obtaining in the country. Before launching full-fledged morbidity 
Surveys, therefore, it is desirable to initiate small pilot studies and the experience gained 
by such studies may be utilized with advantage for planning more effectively in the future 
health surveys. With this end in view, the West Bengal Health Survey was sponsored by 
the Indian Statistical Institute in 1955, in conjunction with an enquiry into the employment 
conditions of the rural and urban populations purely as a pilot study to evolve suitable 
methodology for the collection of morbidity and medical care statistics. 
however, the National Sample Survey (NSS), the only 
organization in India collecting socio-economic statistics on a national scale, has included 
within the scope of its enquiry a few questions regarding the Сзаттелоо of certain D 
events like births, deaths, marriages and diseases. The information, dines iM ae 
broad analysis, is not adequate for a critical evaluation of the health conditions of the peop i 
With its unique position in the country, the NSS is, perhaps, the gna competent single 

survey: ation-wide basis. Advantage should, 
Organization which can undertake health surveys on à nation- iris. 
] rs achinery for conducting health surveys in future for 
therefore, be made of this elaborate mac à 
yielding optimum and quick results. 


1.10. In recent years, 


. CHAPTER 2 


Summary FINDINGS 
Survey was initiated by the Indian Statistical 
Surve 


9 m Vest Bengal Health SENE P Е 

Instit; Ra 6 Д às à pilot study to evolve à suitable M S АИ 
3 3 55 ч аѕ а v 79. rur 3i 5 а оуег 

aft ute in 1955 D Jio statiatios: А total of 1172 rural shoves ho et thie 2 ег 
ыш ana, а seholds distributed over 5 sample tow ан or cities were 

and 566 urban h Was ample were kept under observation for a period 

he sa 
formation collected related to 


members of the household, 


72 sample villages г 
Surveyed. ‘The households selected in Р 
Of 3 months by periodical visits. The in 


(i) demographic particulars of the 
(ii) housing and sanitary conditions, 
(iii) composition of the diet, 
B я. | 
Ду and medical care, Mae OE 
morbidity ; ^ current pregnancy termi ۴ 
fic details of cur n rk d OR 
піт 
y term 


s whose 
; d for periods shorter than 3 months were 


tions of 


(v) speci 
duration exceeded 3 months were 


(vi) history of past pregnane, 
2.2 ity rates. All illnesse 
2.9, Morbidity rates ee 


Classified as chronic and those tha 
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classified as acute. For the former, the morbidity rate has been expressed by its rate of 
prevalence, that is, the number of cases per 1000 population at an instant of time, and for 
the latter the morbidity has been expressed by its rate of incidence, that is, the number 
of new cases arising in a year per 1000 population. The total prevalence rate of chronic 
diseases observed among the urban population was higher than that observed among the 
rural population, the rates being 35 and 20 for the urban and rural populations respectively. 
In respect of acute diseases also the estimated incidence rate of the urban population was 
higher than that for the rural population, being 423 and 328 respectively. 


2.3. Taking all illnesses and injuries together, a total of 100 cases per 1000 rural 
population and 150.cases per 1000 urban population were observed during the 3-month 
period of this survey. In a similar survey conducted in U.K. by the Ministry of Health 
(1946), a total of 5518 cases were observed among a group of 7000 persons during а three- 
month period, that is, 790 cases per 1000 population. The contrast between the estimates 
of West Bengal and U.K. is indeed very striking and the results suggest that the people 
of U.K. are less healthy than those of West Bengal, which contradicts the prevailing notion 
about the relative levels of health in these two communities. 


2.4. Recall lapse could not possibly explain the huge difference between these 
two reported rates as we have kept a follow-up record for cach family over the three-month 
period by visiting each household 4 times at intervals of 3 weeks. The reason must be largely 
ascribable to the low level of health consciousness of the West Bengal population which is 
essentially the result of their low level of living. 


2.5. As there is no well-defined line of demarcation between the state of health 
and that of disease of an individual, it is likely that the morbidity returns obtained in an 
enquiry of this type will be influenced appreciably by the level of health consciousness of 
the community. The lower the level of health consciousness, the greater is the chance of 
overlooking minor ailments and of reporting only those illnesses that cause severe pain oF 
disability. 

2.6. In Tables 5.10 and 5.11 are shown the incidence and prevalence rates for 
each group of diseases. Among acute diseases, those of the respiratory system alone 
accounted for about half of the morbidity among the rural and urban populations. Dysentery> 


@агооа and other diseases of the digestive system occupy the second place in order 9 
importance in the morbidity pattern of West Bengal. 


2,7. Among chronic diseases taken individually, pulmonary tuberculosis, perhaps: 
is the most important disease particularly in the urban population. The prevalence rates 
for pulmonary tuberculosis among the urban and rural роуна n 3.77 and 1.68 
per 1000 persons respectively. | : x: 


2.8. When the total morbidity for all types of illnesses were estimated, the only 


i : source of error arose due to non-reporti SS а. 
important source a non-reporting of minor illnesses by the respondents: 


However, when one attempts to estimate the morbidity rates for individual group? of 

diseases, the error due to misclassification by causes of disease is bound to arise. The prevail 

ing opinion among public health workers regarding the errors of misclassification 18 that 
А a 


it is likely to be enhanced considerably if the investigation is conducted by non-medic® 
personnel. As the West Bengal Health Survey was conducted by non-medical investigator: 


it was considered desirable to assess the validity of the morbidity rates for specific groups 


of diseases. For this purpose, about 400 addresses of patients attending the О.Р. ° 
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the R.G. К 
. G. Kar Medical C = 
The households om Е Collega Hospitals were collected, together with their di ; 
feae gf чик which those patients belonged were apportioned ir diagnostic reports. 
estigators, one medical and the other non-medical, and al between two 
‘ass г d т е rej H 
ass by these two teams were compared with the CREE ee 
g reports 


і " 

›у household сапу 

Тһе тез i in T 
results of such comparison are shown in Tables 


obtained fri 
rined from the hospital register. 


5.1 to 5.7 of this report. 
pancies in tallying the investigators’ reports with th 
e 


correspondi s 
i ing entries i 5 ; е 
report tl Б entries in the hospital register were observed, one arising out of 
he 1 ss of am we es 3 out о: i 
ness of the afflicted individual and the other arising out of miscl: ee me 
isclassification of 


the 
cause of the diseas 1 
he disease. In respect of the first, the medical investigators’ perfi 
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2.13. Of the three important systems of medicine practised in this country, namely, 
Allopath, Homeopath and Ayurved or Unani (Indian), the one most frequently resorted 
to is the allopathic system, 41 per cent of the cases among the rural and 51 per cent of the 
cases among the urban population availing of this system. 


2.14. Those who did not avail of any sort of medical treatment were asked to 
state reasons for not availing medical care. 41 per cent of rural and urban patients stated 
that ‘sickness was not serious’ and 33 per cent of the rural and 6 per cent of the urban 
patients stated that ‘medical care was too expensive’. 


2.15. As regards maternity care, the present position seems to be very unsatisfac- 
tory. About 24 per cent of the rural deliveries and 20 per cent of the urban deliveries 
were without any sort of professional attendance. Those attended by untrained nurses 
(dhais) comprised 76 per cent of the rural deliveries and 25 per cent of the urban 
deliveries. 


2.16. Infant mortality. For the purpose of relating the general health of the 
population.to various factors such as nutrition, housing and sanitary condition etc., it 
was thought advisable to use one single index such as infant mortality rate. An 
analysis of infant mortality rate by level of nutrition and sanitation of tho dwelling 
place shows a high association of these factors with the infant mortality rate. It was 
observed that among the class of population in a moderately good level of nutrition, 
the estimated infant mortality rate was 116 for the rural and 111 for the urban population, 
whereas among the under-nourished class, the estimated infant mortality rates were 171 
for the rural and 150 for the urban population. Similarly, it was observed that among 
those in the urban sector whose housing condition was somewhat satisfactory the infant 
mortality rate was 85 as against 149 recorded among those whose housing condition WAS 
definitely unsatisfactory. Infant mortality rates were | 


economic characteristics of the fathers of the infants and here also it was observed that 


the association of infant mortality rates to these factors was very striking. Though these 
ape x iron у H 1 | "d 

factors may not directly determine the health conditions of the population, they serve 2 
useful criteria in stratifying the population for obt 


survey of this kind. 
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2.18. The entire rural and urban samples in this survey were divided into two 
sets of interpenetrating sub-samples and some of the important estimates given in the body 
of the report have been obtained sub-samplewise and shown in Appendix 1. 
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3.5. The exact details of information falling under the above 6 main heads can be 


seen from the schedule, a facsimile of which is reproduced in Appendix 2. 


3.6. Method of investigation. In general, there are two lines of approach to the 
study of illness in a population. 


(i) The single visit survey by which records of illness for a sample population 
on the day of visit or for a specified period of time previous to the visit are 
collected and 


(ii) keeping a sample population under continued observation over à period of time 
and recording of illnesses happening during that time. 


3.7. Though both the methods yield valuable results, it has to be borne in mind 
that due to limitations of memory of the informants some of the illnesses particularly those 
of the respiratory and digestive systems which are of a minor nature and cause little or no 
disability are largely missed unless the period referred to is a short one. In any survey 
where probing into past events is a necesary feature of the investigational procedure, due 
safeguards have to be taken to eliminate or reduce to the minimum the effects due to 
memory lapse. This is an intricate problem as too short a period will considerably reduce 
the time coverage and too long a period will obviously result in serious recall lapse. Even 
in countries where the publie health administration has attained a high level of officiency 
the element of memory lapse has added considerably to the difficulties of carrying out 
morbidity surveys. The investigational procedure of this survey was designed in such ® 
fashion that it was possible for the investigators to visit each houschold selected in this 
survey 4 times during the period of the investigation. The duration of observation extended 
over a period of 3 months commencing 14 days prior to the first visit and terminating with 
the date of last visit. As the period between successive visits was usually 3 to 4 weeks, 
a continuous record of vital events could he collected by this survey which may be reasonably 


assumed to be free from any major source of error arising out of recall lapse and at the same 
time providing a substantial coverage over time. 


3.8. The schedule has been divided into 9 blocks and each block relates to a specific 
aspect of the survey. In the first visit all the blocks except block 9 were to be filled in. 
In the second and third visits only the information regarding the illnesses or injuries and 
medical care (block 7) and current terminations (block 8) were to be entered. In the 


fourth visit which was the final one, besides blocks 7 and 8, block 9 giving particulars of 
past terminations was to be filled in. V 


3.9. То avoid possible errors arising out of vague definitions of terms it is essential 
at the outset of any survey, to give precise definitions to certain categories included in the 
questionnaire. It may be even desirable at times to modify the нені definitions 
of terms to suit the specific object in view, or to conform to the investigational procedure: 
For instance, the usual definition of a household as denoting a group of persons taking 
principal meals from a common kitchen for at least 16 days during the month preceding 
the date of enquiry, which is generally applied to socio-economic studies had to be slightly 
extended for the purpose of this survey to include within its scope three more categories: 
namely, (i) all children born during 14 days prior to the date of visit to members of the 
household, (ii) all persons dead during the 14 days prior to the date of visit, who, if alive: 
would have been classed as members of the household and (iii) all persons admitted into 
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hospitals ¢ TET i 
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classified as acute or chronic with the help of expert medical advice. Each of these two 
classes was further subdivided into 3 groups according to the nature of disability caused, 
namely, (i) non-disabling, (ii) disabling but not causing confinement to bed or hospital and 
(iii) causing confinement to bed or hospital. In all non-bed cases disability to perform normal 
function as working, attending school etc., due to sickness was a readily recongizable fact, 
wherever it existed. 


3.17. However, in cases of illness to those persons who have no such functions to 
perform as for instance, infants and aged persons, it is rather difficult to distinguish between 
the state of disability and non-disability. In such cases, the sickness was considered as dis- 
abling if the affected persons availed of either medical treatment or special diet. Here also. 
as before, the diseases were classified into 6 categories according to the nature of disability 
caused and the appropriate codes were entered in columm 5. 


3.18. The date of onset of a disabling disease was reckoned from the date on which 


the disability actually started and was entered in column 6. If the disease was а non- 


disabling one, no information on date of onset was sought. The date of recovery was the 
date on which disability ceased, and if the patient recovered within the reference period of 
a particular visit, the date of recovery with the prefix ‘R’ (for example, R—4th April) was 
entered in this column. If the illness resulted in death, the date of death with the prefix! D' 
(for example, D—4th April) was entered in this column. Tf illness was prevailing on the 


date of visit merely the letter ‘P’ was entered in this column, indicating that a follow-up 
was needed in the next visit. 


3.19. Тһе type of medical attendance availed for the illness under consideration 
fell under the categories of ‘allopath’, ‘homeopath’, ‘ayurved’, ‘unani’ and ‘quack’. Appro- 
priate codes were entered in column 9. In view of the fact that a significant proportion о 
the population, particularly in the lower social and economie levels, still seek or are foreed 
by circumstances to seek the help of quacks, it was thought desirable to add this category 
also to the various types of medical attendance, though in any scientific discussion such. cases 
are to be taken as equivalent to no medical care availed. 
one type of medical help was availed, multiple codes were to be entered and those cases which 


were not medically attended were to be shown as having not received medical attendance 
Obviously, for those belonging to the latter category, the attend 


Tn those cases where more than 


ance would not arise. 
3.20. Details of expenditure on medical care incurred for each case during each 
reference period were to be entered in columns 10 to 12. 


А : Physician's fees and cost of medi- 
cines were to be entered in columns 10 and 11 respectiv 


ye М ely while column 12 was meant for 
writing down such expenses incurred towards hospital rent, nursing transport ete Tf the 
» sing, < te. 


amount expended by the household was to cover more than one case of illness, then it WaS 
necessary to split the total amount and to allocate to each case its share | 


3.21. Tt is generally known that a sizeable fraction of illnesses does not receive any 
medical attention at all. Such being the case, it was thought useful to elicit information 29 
to why medical care was not sought. The probable answerg such as ‘hospital or physician 
not available’, ‘too expensive’, ‘no faith in treatment’, ‘sickness not serious’, were codified 
and the appropriate code (s) was entered in column 13. Tf there were reasons other than those 


specified above, they were all lumped together and put under the general head ‘other reasons 
and given a separate code. 
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Name of the disease as stated by the informant and entered in column 4 
together with the available details regarding the signs and symptoms of the disease given 
by the informant of his or her own accord formed the principal criteria for classification of 
diseases by causes. > It was felt at the outset of the survey that without sufficient confirmatory 
evidence from factual experience it would be difficult to assess the accuracy or validity of the 
information on diseases thus collected. Hence, an attempt was made in this investigation 
itself to collect relevant material for a validity study. For this purpose, investigators were 
instructed to pick out cases which were medically attended and to enter the diagnostic report 
of the attending physician wherever such reports were accessible. Such a check was expected 
to give the necessary supporting evidence for the validity of the returns obtained in the 
survey. 

3.23. Only such cases of child-birth occurring to members of the selected housholds 
during the four reference periods were to be entered in block 8. As the number of house- 
holds covered by the sample was of the order 1750, it was not expected that more than 80 
births would be recorded during the course of the observational period of 3 months. Hence, 


information on only a limited number of items pertaining to post-natal care have been 


collected. 
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woman and her husband (living or dead). This block was filled up only in the last visit e 
the household since it was considered that in making such enquiries about the past histories 
of the woman a more intimate acquaintance with the household was desirable to enlist the 
full co-operation of the household. The information on the age of the mother at successive 
terminations and the result of each termination were entered in this block. In the event 
of the last termination of any woman taking place within one year prior to the visit, the month 
of birth and the result of the termination were to be recorded in columns 32 and 33 respec- 
tively. This was necessary because for such children who have not completed one year of 
age the period of exposure had to be separately estimated. Also in respect of such 2 
termination, information on ante-natal care was to be collected and entered in column 34. 
A two-digit composite code, the left-hand digit indicating the type of attendance and the 
right-hand digit indicating the number of such attendances was to be used. 


CHAPTER 4 
INFANT MORTALITY 


4.1. One of the vital rates with a very wide range of applicability in the field 
of public health is the infant mortality rate. This is expressed as the number of deaths 
under one year per thousand live births. Apart from its practical utility in maternal and 
child health studies, its value as a general index of the health of a population group is in 
no way inferior to such mortality rates as standardized or life table death rates. Its high 
sensitivity to the general living conditions of the population to which it relates makes it 
an immensely valuable measure for comparison of health conditions of different population 
groups. 


4.2. For the purpose of this survey, a household which has been selected аз the 
ultimate sample unit is defined as a group of persons living together for a period of one 
month previous to the date of visit. Obviously, this is too short a period for studies of vital 
events like deaths, etc. This definition necessarily excludes from its purview the considera- 
tion of vital events relating to persons who were members of the household earlier but not 


at the time of survey. But the scope of the definition of the household had to be restrained 
in the above manner to obviate certain practical complications that might arise due to the 
mobility of the members of households. For example, it is not unlikey that with the death 
of the principal earner of a household his dependents are eventually absorbed as members 
of other households leading to a dissolution of the original household in which the death 
occurred. Such circumstances will naturally lead to gross under-estimation of deaths since 
the sample frame excludes households in which such events occurred. In this survey? 
however, a continuous record of births, deaths and illnesses occurring » the members of 
the selected households during a span of three months could be maintained since the survey 
was conducted by four visits at specified intervals. Even a period of dane months, it is 
to be admitted, falls short of the requirements for obtaining sufficiently reliable estimate? 
of the age-specific mortality rates which enter into the computation of standardized p 
life-table death rates. 


| 43. Since married women belonging to the selected households formed a represent?" 
tive sample of all living married women in the population, the infant mortality rates relating 
to the live births which occurred to them during the past one or even two years could be 
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tonsidered as adequately depicting the infant mortality conditions prevailing during the 
period. This is true because all children born during the period except those born to dis 
dying during the period will.be represented in the sample whether or not such infants could 
be considered as members of the household according to the definition. The fraction of 
women dying during the period of one or even two years being negligible, the infant morta- 
lity rate calculated on the basis of the surviving women is not likely to be significantly different 


from the one calculated by inclusion of the dying women. 


А 4.4. The above considerations go to show that the infant mortality rate, apart 
from its usefulness as a general mortality index, has a definite advantage over other general 
mortality rates with regard to the statistical validity when the study is based on sample 
survey data. For this reason, the infant mortality rate has been utilized in this study as 
a general mortality index for the comparison. of different social and economic classes. All 
ach married woman of the household, the order of such births 
at the end of one year after birth can be obtained 


from block 9 of the schedule. For the computation of infant mortality rates of different 
a short reference period of two or even five years would have been 
desirable because such a period has two advantages, namely, (i) infant mortality rate based 
on recent live births would be more appropriate to depict the recent health conditions of 

te would be statistically more valid 


and (ii) the infant mortality ra 
Due to the limitations in the present data, 


almost impracticable and a much wider 
ate of survey to each ever married 
on the basis of infant mortality 
ational Sample Survey 


live births occurring to e 
together with information on survival 


population groups, 


the population groups 
as it would be relatively free from recall lapse. 
the choice of such a short reference period becomes 
basis to include all live births that occurred upto the d 
The validity of the comparison 
rates thus obtained may be disputed as it was pointed out in the № 
Report No. 7 that the infant mortality rates relating to the pre-1930 marriage cohorts were 
inordinately low compared to the official rates for corresponding periods. As the NSS data 
were collected in the year 1952, it might be expected that about 40 per cent of the pre-1930 
marriage cohorts might have died earlier to the date of survey with the result that this 
group got automatically excluded from the analysis. Since the cohorts thus excluded were 
likely to belong to the high mortality group, the infant mortality rate estimated from the 


surviving group might have been probably biased towards a lower value. Under the 
circumstances, it is difficult to attribute the entire difference between the NSS rate and 


the official rate to recall lapse. J 
ysis, therefore, we have, in the first instance, limited our 
d women with 
ed by parity and the infant mortality rate 
urban samples are 


Woman was resorted to. 


5 following anal 
4.5: In the fo P at least five terminations. The 


study exclusively to the group of ever-marrie 


live-births occurring to such women were arrang dein eed 
P^ н s for the rural and 
for each successive parity was estimated. The rates for the 


Shown in Table 4.1. : 
ble clearly show, as should be expected, that 


e initial parities, particularly to the first two, are 
bsequent parities. Tf recall lapse did really 


4.6. The results given in the above ta 


the infant mortality rates relating to th 


Substantially higher than those recorded for su Ps. AU 
effect, the va mortality rates in the distant past, then the estimated rates for the initial 


two pariti 14 not have exceeded those for subsequent parities both among the rural 

arities couic i 3 A nd i "ings 

and urba dh lations to the observed extent. Moreover, since the infant mortality rates 

Tor the Bs baro ЖЕНШ are based оп births relating to the same group of women, and if 
S ve г 
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it can be assumed that the average time interval between the first and fifth parities is gus 
15 years, there seems to be not much ground to suspect a substantial reduction in the infant 
А 


TABLE 4.1. MORTALITY RATES FOR INFANTS BORN ТО 
WOMEN HAVING 5 OR MORE TERMINATIONS 


order of birth rural urban 
(1) (2) (3) 
1. Ist 244.00 205.13 
(500)! (156) 
2. 2nd 228.00 202.53 
(500) (158) 
3. 3rd 198.02 121.02 
(505) (157) 
4. 4th 157.37 132.08 
(502) (159) 
5. Sth 134.92 87.50 
(504) (160) 
6. 6th 118.13 123.81 
(364) (105) 
7. "th 101.21 51.28 
(247) (78) 
8. 8th & above 96.67 148.15 
(300) (108) 
9. all orders 169.49 139.69 
(3422) (1081) 


1 Figures in parentheses refer to tho num 


2 i bers of live-births on which 
the infant mortality rates aro based. 


mortality rate due to recall lapse. 
successive reduction from the initial s 
are likely to drop out and as such th 
earlier parities. 


In parities higher than tho fifth, there would be ® 
et of women as some of them with fewer terminations 
eir rates are not strictly comparable with the rates for 
4.7. As the parity advances, the estimated infant, mort 
more and more closely to recent events and if recall lapse were si 
been a tendency for the estimates to rise with advancing paritie 
Table 4.1, however, do not give any evidence of a rise, 


А 1 
ality rates also correspon 
à ге 
gnificant there should p 
à min 
$. The estimates shown ! 


4.8. If, on the other hand, a large fraction of the 
foregoing analysis were too advanced in age (say, 60 years or &bove), then even in the later 
parities, one should expect a number of births that took place so long ago as to be affected 
by recall lapse. It may, therefore, be argued that such comparisons 
hardly detect the existence of recall lapse unless the rates for a 
relate to very recent events. As such, the births occurring to eve 
43 years and over were classified into two chronological groups, 
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a lier : E З 
: cs earlier to the period of 15 years preceding the date of survey, and the infant mortalit: 
ates for these two groups were mutually compared. The results are given in Table 4. E, 


TABLE 42. MORTALITY RATES FOR INFANTS BORN TO EVER 
MARRIED WOMEN AGED 43 YEARS OR OVER 
ACCORDING TO CHRONOLOGICAL GROUPS 


infant mortality rate 


period when births occurred m 
rural urban 
(1) (2) (3) 
within 15 years preceding the date of survey 69.26 119.05 
(231): (84) 
160.28 137.17 


15 or more years earlier to survey 
(1984) (678) 


1 Figures in parentheses refer to the numbers of live births on which 


the infant mortality rates are based. 
ated with higher infant mortality rates 
ogical comparison should have been attempted at corres- 


acticable with the data in hand, all parities were mixed 
ogical group. As the recent chronological group 
arities one should naturally expect a lower 
as this is clearly indicated by the rates 


4.9. Since the earlier parities are associ 


than the latter ones, the chronol 
Ponding parities. As this was impr 
together into one lot within each chronol 
is likely to be more heavily loaded with later p 
infant mortality among them. Among rural birtl 

entered in Table 4.2. Tf recall lapse was operative, such striking difference in the infant 
Mortality rates would not have been observed. As most of the women included in the urban 
Sample were of advanced age, hardly 84 births oceurred to them during the lst 15 years 
and the remaining 678 births were classified in the older chronological group which naturally 
included a substantial number of later p The contrast between the infant 


arities as well. 
mortality rates for the two chronological groups is, therefore, relatively less marked than 
that observed for the rural samples Tn any case, it apnea from the above analysis that 
the effect of recall lapse is not statistically very significant. 


4.10. In this study. the estimates of infant mortality rates have been based on 
all live-birihs that sit 1 to e d women of the respective population groups 


я ге been estimated for the rural 

upto the date of survey. The infant ie е gee 

urb 1 for each of the two sub-samples in 
an sectors separately and the results 


Table 01.1 of Appendix 1. 
" 4.11. In what follows, an attempt i 
iy rate with such factors as nutritional lev 
Tonal a ional status of fathers. 

nd occupation Nutrition being ® vital factor of health, a separate block 
ge ihe various items regarding the dietary 


le for entering . 
hedule survey did not, however, permit a detailed 


ver-marrie 
mortality rates 1 
are shown 


ade to study the association of infant morta- 


sm 
old, housing condition, educa- 


el of the housek 


4.2. Nutrition. 


(block 4) was devoted in the sc pe 
е Ура hi 
, position of the households. Шаан + and assess the nutritional level of the 


nad of the various items comprising pat рала Gist. isi SAT S a n 

ous : a T i А a 

Usehold based on its e d sarily avail even the basic energy-giving food 
" a |. 


е Majority of the Indian house 
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articles like cereals, etc., for satisfying their hunger. This being the case, differentiation 
of diet can be effected even on the basis of the quantity of cereals consumed. Tf appropriate 
adult equivalents for various age and occupational groups were available, one could ped 
classified the households by varying degrees of nutritional level on the basis of the quantity 
of cereals consumed. In this study, however, it has been assumed that a household which 
avails even a small quantity of milk, meat, fish, fruits, ебе., for whatever it is worth, must 
be doing so only after satisfying its basic needs in respect of cereals. On this assumption, 
diets which are almost completely devoid of milk, fish, meat etc., have been placed in the 
category ‘low level of nutrition’ and the remaining diets in the category ‘high level of 
nutrition’. The infant mortality rates observed in the above two dictary classes are shown 
separately for rural and urban sectors in Table 4.3. 


TABLE 4.3. INFANT MORTALITY RATE ACCORDING TO LEVEL OF 
NUTRITION IN WEST BENGAL 


rural urban, 
nutritional level 
no. of infant no. of infant 
live mortality live mortality 
births rate births rate 
a) @) (3) (4) (5) 
1. high 155 116.13 659 110.77 
2. low 5384 170.69 1186 150.08 
3. total 5539 169.16 1845 136.04 


4.13. In both the rural and urban sectors, the differences in the recorded infant 
mortality rates are significant, the difference being more pronounced in the case of the 
rural group. Inasmuch as the rate corresponding to the nutritional group classed as ‘high 


in the rural sector is based on an inadequate number of live births, the observed difference 
is to be accepted with a little caution. 


4.14. Housing. Housing is an important factor affecting the health of a population. 
One of the objectives of this survey was to study the association between the sanitary 
assessment of the household and certain important health indices like infant mortality "®? 
and thereby to evolve a suitable methodology for collecting information on housing anc 
environmental conditions. Hence, a few questions relating to the housing and sanitary 
conditions of the household and its environment were included in block 5 f the schedule. 
Only such aspects which were more easily definable and less subjective ate nature have been 
introduced in this block and as such it had to be necessarily "we Nevertheless. the 


И C BONUS amiga a аш approach adopted, particularly: with 
reference to the rural sample. pred, partic 1 


4.15. In the rural area, most of the households were without latrines and а8 such 
no variation in the type of latrine used was ascertainable from the returns. The informatio” 
obtained on ventilation of households though somewhat subjective in nature, proved to be 
defective for the rural and urban areas on account of the vagaries of the investigator: 
regards the general sanitation the investigators relied more on the relative differences among 
the households allotted rather than on the objective classifications specified for the investi 
gational procedure. In the rural sector particularly, where the village as a whole was 
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allocated to each investigator, the entiries relating to this aspect for all the households w 
almost identical, the nature of such entries being determined largely by his a 
and cities, however, the households allotted to each investigator 
3 ch conspicuous differences in sanitary conditions 
existed, there was a greater degree of variation in the general sanitation codes entered by 
ater, taps, tubewells and ordinary wells were 


him. With regard to source of drinking w: 
almost universal in the urban areas. whereas in the rural areas the majority of the people 
Hence, with the type of information available, it was thought 


depended on wells or ponds. 
that no useful classification of rural households could be possible on the basis of such 
ing place. For the classification of the 


characteristies as sanitary conditions of the dwell 
rmation on latrine and general sanitation of sorroundings alone could 
The households with code 1 for both ‘latrine’ and ‘general sanitation’ 


the rest as ‘housing bad’. The infant mortality rates 
are shown in Table 4.4 which clearly indicates 
es in and around its dwelling place 


impressions. In towns 
bei СНЕ Ў 
eing situated over an area within whi 


urban households, info 
provide a valid basis. 
were classified as ‘housing good’ and 
in these two classes of the urban population 
that the population which avails better sanitary ameniti 
is associated with a lower infant mortality rate. 

ANT MORTALITY RATE ACCORDING TO HOUSING 


TABLE 4.4. INF. 
ITIONS IN WEST BENGAL (URBAN) 


COND 
(2) i (3) 
(1) 
366 84.70 
1. good 
1479 148.75 
2. bad 
1845 136.04 


eria considered above, namely, nutrition 


t, the two crit › 
No doubt, f health of a community. Data 


SS he state 0 
an ; sp: i ] indices to study the 8 
d sanitary condition are DRM on the literacy and occupational status of the popula- 

n this Д 


have al i д Я . 
so been collected it in earlier Я 
tion. Though these тау not directly influence the jme w s bes us 
dl а А tors as education an n bear a 
it has been noticed that such socio-ec 0 
же 7 E . these socio-economic factors 
Strong association with the infant mortality rate. Hes den ethos gri 
can be used with great advantage in health studies for more efte e 


households for sample selection. 

quality of maternal care and personal 
ness of the community its association with 
есь опе. Besides the above, its indirect 
e regarded as ? itary condition etc., enhances the 
f the importance of education it would 
nder the prevailing circum- 
cially in the rural sector are 
ion of fathers for the above 


ly two broad categories were 


4.16. Education. 


4.17. In so far as educatior 
hygiene by raising the level of health conscious 


infant mortality rate may b " 
relati В Ith factors aS n 
ationship through such ратат 1 evaluation о 


degree of iati 
such association. 3 xothers but w 
have been desirable to consider the education of D 
e majority 
the larg 1 coun the 


stances this is not feasible as nee 
ded to take т f the data оп 


illiterate. Hence, it was deci ү 
les in which the ma 


analysis. Here again, due to the limited oe 
Considered, namely (i) pirths relating to coup: 
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higher qualifications and (ii) the remaining births relating to couples in which the male 
partners were either illiterate or under-matrics. The proportion of births belonging to 
the higher educational group (matric or above) to total births were roughly 329, in the urban 
and 2% in the rural sector. The estimated infant mortality rates for the two literacy classes 
are shown separately for rural and urban samples in Table 4.5. 


TABLE 45. INFANT MORTALITY RATES ACCORDING TO LITERACY 
STATUS OF FATHERS IN WEST BENGAL 


rural urban 
literaey status = 
live infant live infant 
births mortality births mortality 
rate rate 
(1) (2) (3) (4) (5) 
l. matric and above 108 55.56 404 76.73 
2. below matric including 
illiterates 5431 171.42 1441 152.67 
3. total 5539 169.16 1845 136.04 


4.18. The results are indeed very striking, the higher literacy group showing an 
infant mortality rate which is nearly one-third and one-half of the lower literacy group in 
the rural and urban areas respectively. This is a clear indication that literacy status is an 
excellent criterion for stratification of urban households in sample surveys of this nature. 
In view of the fact that only about 29/ of the births in the rural arcas correspond to the 


higher literacy group, a similar stratification is of doubtful utility in studies of this kind 
for rural populations. 


4.19. Occupation. In order to study the behaviour of infant mortality rates in 
different occupational groups a similar analysis as above was carried out. Here again, tha 
occupational classification had to be confined to those of the m 


paucity of data the analysis had to be limited to four bro; 
for the urban and rural sectors as indicated below. 


4.30. Urban: 1. 


e ра e to 

al pa rtners only. Due t 
ies o i 8 

ad categories occupational statu 


Manual labour (mostly unskilled industrial labour, domestic 


servant, porter, hawker, rickshaw puller, artisan, etc.) 


Lower professions and inferior busin 
retail trader, shop assistant, 


bo 


ess (clerk, school teacher: 


Skilled industrial labour, etc.) 
3. Higher professions and superior 


р business (doctor, professor: 
engineer, lawyer, wholesale trader. 
> 


eto.) 
4. Non-gainful occupations 


(rent receiver, remittance receiver, 
beggar, etc.) 


4.21. Rural: 1. Agricultural and other rural labour ( 


à landless agricultural labours 
artisan, fisherman, cooly, etc.) 


bo 


Agricultural operations ( 


cultivator owning land, share-cropPe” 
etc.) 


Professions and trade (teacher, doctor, priest, retail trader, ete.) 


4. Non-gainful occupations ( 
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22 i 
т 4.22. In both the urban and rural populations, the class ‘non-gainful occupations’ 
contains a highly i it i 
ntains a highly heterogeneous social group as it includes all persons returned as ‘not in 
1e labour force’, irrespective of their living standards. 


4.23. The estimated infant mortality rates for the different occupational classes 
are shown in Tables 4.6 and 4.7 for the urban and rural populations respectively. 


TABLE 4.6. INFANT MORTALITY RATE ACCORDING TO OCCUPATION OF FATHERS 
IN WEST BENGAL (URBAN) 


_————————— 
occupation class no. of infant mortality 
live births rate 
(1) (2) (3) 
1. manual labour 492 170.73 
3. lower professions and inferior business 1061 126.30 
3. higher professions and superior business 155 45.16 
4. non-gainful occupations 137 189.78 
1845 136.04 


5. total ` 


TABLE 4.7, INFANT MORTALITY RATE ACCORDING TO OCCUPATION OF FATHERS 
IN WEST BENGAL (RURAL) 


О_о“: 
infant mortality 


i no. of 
occupation class Nnm EY 
(1) (2) (3) 
l. agriculture and other rural labour 1486 154.10 
3120 183.33 


agricultural operations 
429 158.51 


3. professions and trado 


2 


504 134.92 


4. non-gainful occupations 
1 5589 169.16 
5. tota 


А .) of the urban population the 
T scupation class (manual labour) o: 
4.24. In the lowest occupat I hor arde a npe 


" s М 7n 73 пег 1000 live- 
recorded infant mortality rate 15 a5 high as 170.73 ре igs ч + low as 45.16 per 1000 live- 
Social class (higher professions and superior business) № 18 4 


2 4 ‘ong and inferior business) it is 
births intermediate class (lower professions anc 

s and that for the interme iate cla : у en Р 
126.30 per 1000 live-births. From these results, it is quite apparent that the infant mortality 


В : dder. In rural areas, however, due to 
rates ^ e goes up the social la à а 

tend to i oe я ont Moa belonging to the higher professional group, 
er 


lower professions to form class 3 (professions and trade). 
th Tt *^ Mme belonging to the lower professions, infant 
ance of the ў 


i Т need. S regards 
y rate thi bi TO was considerably enhan A 
| i rate observed for his com ined g up 1 


ma E s every cultivator owning land 
Social lags 2 (agricultural operations), vhi REIR DA nes П d на 
class 2 (agricu a be, and every ghare-cropper, owsoever SMa he area 

ngs may be, ? 


OWever is holdi А For these reasons, the results 
Operated ي‎ be, is certainly 2 heterogeneous group. | -— 
him n : 


the į 

ом inadequate num 

Ge were merged wi 
Че to the preponder 
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in Table 4.7 do not suggest any occupational differentials in the infant mortality pide 
This, perhaps, indicates that for stratifying the rural population into social classes, н тау 
be desirable to take cognizance of certain other relevant factors. Мау be, social differen- 
tiations based on either income or land operated or owned may be more appropriate for 
health studies as this will more closely correspond to the actual living standards of the 
households. 


4.25. In this context, it may be appropriate to consider the Registrar General’ 
figures of infant mortality rate by social class of father in England and Wales in 1939 


(Table 4.8) and see how the social position of the community affects the infant mortality 
rate. 


TABLE 48. INFANT MORTALITY RATE ACCORDING TO SOCIAL CLASS OF FATHER 


IN ENGLAND AND WALES IN 1939 


= annaa 


social class infant mortality rate 


(1) (2) 
l. class I—the professions, commissioned officers and well-to-do 
people concerned with finance, shipping etc. 26.8 
2. class II—intermediate between class I and skilled workers 34.4 
3. class III—skilled workers 44.4 
4. class IV—intermediate between skilled and unskilled workers 51.4 


e 


class V—-unskilled workers 60.1 


4.26. The above figures exhibit a high degree of consistency and regularity in the 
changing pattern of infant mortality rate with changing social class. A similar feature 
is observed in the occupational classes in urban West Bengal. This clearly suggests that 
occupational or social status offers an excellent criterion for a more effective stratification 
of the urban population for sample selection in studies of this kind. 


4.27. In conclusion, it may be stated that the higher level of nutrition and kn 
higher level of literacy and Occupational status are associated with lower infant mortality 


rates. It is also interesting to note from the results given above that when comparable 
groups are matched against each other the rural 


s ; t 
groups generally show higher infan! 
mortality rates than their urban counterparts, 


However, the infant mortality differentials 
estimated from registration data indicate an entirely opposite picture. Even in respect © 


the overall estimate of infant mortality rate the value observed in this study for the rure 
sector appears to be higher than that observed for the urban sector. If this is true: it 
is obviously at variance with the accepted notion on urban-rural differentials based on 
registration data. Since official figures do not relate to allocated rates, further examina- 
tion of the registration data is necessary before arriving at any definite conclusions. 

4.28. It is a known fact that infants die at a faster rate during the earlier periods 
of their life. It can be reasonably assumed that in rural West Bengal about 40 per cont 
of the infants die before they complete one week. Possibly, the infant mortality pite 
observed in case of rural births may be largely attributable to deaths occurring during шш 
stage of life due to inadequate and unsatisfactory maternity aid available to rural mother 
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CHAPTER 5 


MORBIDITY 


5.1. At present the statistical data collected on health aspects are so meagre and 
unreliable in our country that they can utmost provide a very crude and hazy outline of 
the health conditions of the nation. But these are not enough for sound publie health 
administration which obviously has to be based on reliable and adequate factual data. In 
the past, a few special surveys relating to malaria, tuberculosis, leprosy, etc., have been 
areas only. Such surveys are useful, no doubt, in shaping the public 
but in order to plan for the improvement of 
an appraisal of the disease and medical care 


a general health survey. 


attempted in selected 
health policy to some extent in these areas, 
health standards of the community as а whole, 
pattern is essential and this can be done only by 
The first general health survey to be conducted in this country was the Singur 
But this study, though comprehensive in nature, had 


area in West Bengal comprising of the 4 union 


5.2. 
Health Survey (Lal and Seal, 1949). 


to be necessarily confined to a small rural 
boards falling within the sphere of operation of the Singur Health Centre. It was expected 


that when the report of the above study was published, similar enquiries would be made 

in other parts of India to obtain a general picture of the morbidity and medical care pattern. 

But till the year 1955 no attempts were known to have been made in this direction. 

of initiative in sponsoring surveys of this kind certainly reflects the 

peculiar difficulties inher especially in countries like India where more 

experiments in survey technique and procedure remain to be done before pushing through 
The West Bengal Health Survey, as has already 


full-fledged health surveys оп а vast scale. 
ged health survey: t which was undertaken mainly for the purpose 


been pointe i cperimen: 
d out, is only such an exper à th 
я | for the collection of health and medical care statistics. 


of developing a methodology 


5.3. The lack 
ent in such surveys, 


i i rki i the scheme of this surve; 
rtai rienced public health workers with whom i y 
5.4. Certain experte р h that had been adopted in this survey. 


Was discussed, were rather critical of the approac i 25 n і 
They raised pertinent questions regarding the validity of the morbidity statistics collected 


by such surveys relying mainly on non-medical investigators. The most serious objection 


"pacis isoases by causes. It may be mentioned 
ая Е с classification of diseases 
ntred round the correctness of № cal care has $ universal, the competency 


that even in countries where medi i i 
Of the informants, generally the heads of households, нд Бе mm 
Which has become a thing of the past, is somewhat о ptained by complete уві d 
Placed on prevalence rates for certain chronic diseases О A А poy SIGs 
j neluding laboratory tests by trained medical 

tion-wide scale for a country like 

the important chronic 


examination of the selected individuals = bn 
cute diseases which form a substan- 


become almos 


' ible оп 
Personnel, Even if such 2 scheme were aep orsus d 
: : s 
India having only meagre resources, it could © 


i n inci of à 

diseases and the question regarding the iid a hi cane 

tial bulk of the total morbidity of the country i 

5 f the chief objectives of this study; pe ip te fis Pe the extent 

= ping Sie i d the diagnostic reports at 

88 alt equam the reported causes of d b. he жег وس بی‎ p i 

ti var 8 ts were “ 

hi of treatment where er such т. 2. edital diagnostic reports im a а 

a. hr € hs of f illness were ompanied by such reports in spite 
able, it was found that on 


асс 


4 cases 0: 
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of the fact that about two-third of the cases were medically treated. Presumably, the odds 
against collecting such information were so great that the investigators could not possibly 
succeed in carrying out the instructions. Of the 4 cases for which medical confirmation 
was available, only 2 cases—one of ‘appendicitis’ and the other of ‘pneumonia’—were found 
to be in complete agreement with the investigators’ returns. For the remaining two cases 
which were medically declared as tuberculosis (pulmonary), the informants returned them 
as merely ‘cough and fever’. 


5.0. Though in a substantial number of cases, certain remarks regarding the nature 
of the diseases made by the informants of their own accord and entered in the ‘remarks 
column of bolck 7, were very helpful to the medical experts in arriving at a proper classifica- 
tion of the diseases, it has to be admitted that due to non-availability of confirmatory 
evidence from the attending physicians, no check on the validity of the returns could be made. 


5.7. As the question of validity is an important one on which depends to & consi- 
derable degree the success of a morbidity study, a ‘Validity Survey’ was initiated in 1956. 
For the purpose of this survey, two teams of investigators, one medical and the other non- 
medical, were employed. The medical investigators were medical graduates with const 
derable professional experience. The non-medical investigators, on the other hand, did not 
have any particular training or knowledge in public health. Names of patients with addresses 
were collected from the medical out-patient department of the R.G. Kar Medical College 
Hospitals, which is one of the leading hospitals in Calcutta. "These names were supplied 
to a set of pilot investigators for verifying the addresses as well as to note down the names 
of all members of the households. After this preliminary listing was done, the medical and 
non-medical investigators were given the names and addresses of the heads of the households- 
They were also given the names of all the members of the households to ensure that the 


particular person who had been to the hospital and about whose illness information 28 
available , was not omitted from investigation. They were required to collect details about 
all illnesses occurring to the members of the households within a reference period of ons 
month. The non-medical investigators were instructed to put down the cause of the disease 


as sete by eu. informants and supplement them with details of signs, symptoms ete. " 
the disease, if such information was forthcoming from them on their own initiative. The 


medical investigators, on the other hand, had a greater degree of freedom in that they 


could interrogate the heads of the households or the patients themselves for their view 
on the disease. This freedom was not allowed to the non- 


16 was thought that the non-medical investigator by virtue o 
knowledge or training was not competent to Suggest leading questions to arrive ab? 
proper conclusion as regards the exact nature of the disease, No indication whatsoev™ 
was given to the investigators as to the source of thege addresses. When these hous” 


holds have been contacted and necessary information gathered in the schedules speci? 
designed for the purpose, the returns were compared With the hospital diagnoses 
would seem that the best way of doing this would have been to hi] both the type? 5 
investigators to the same households and compare their results, j 
not appeal to us 


H . e 
medical investigators ec 
f his not having any medic? 


i die 
! : | But this procedure 
as in such a short time which was generally less than a month, it was no 
› 


desirable to subject a household to a series of questions by different investigators especie y 
when a disease was prevailing in that household. Moreover, there sight 5 a tendency or 
the first investigation to influence the result of the Second, 
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5.8. In order to make an overall assessment of the relative merits of the two types 
of investigating teams, the cases diagnosed in each class of diseases were equally apportioned 
between the two teams. It could be seen that by the above arrangement if the extent of 
misclassification varied with the nature of disease, the odds were equally balanced between 
the two teams. A total of 396 cases could be contacted in their households and of these 198 
were investigated by the medical investigators and the remaining 198 by the non-medical 
investigators. The investigator’s returns were then compared with the corresponding 
reports obtained from the hospital register and the results of this comparison are shown 
for medical and non-medical investigators. It may be mentioned 
reporting of diseases being sometimes vague, 
dical experts on the basis of 


in Tables 5.1 and 5.2 
here that the non-medical investigators’ 
they were allocated to proper disease groups Cyaan oe 


signs, symptoms and other available particulars of such diseases. 


NT BETWEEN THE HOSPITAL DIAGNOSES 


TABLE 5.1. THE EXTENT OF AGREEME 
MEDICAL INVESTIGATORS 


AND RETURNS OF THE 


i complete no doubtful not total 
oS "en: agreement recorded 
(1) (2) (3) (4) (5) (6) 
= E = 6 
l. group I—tuberculosis (pulmonary) 6 
10 5 2 20 
2. group IT—malaria 3 
9 2 — 15 
3. group TIT—dysentery 4 
4. group IV—other infectious and parasitic E ; ч МА Е 
disoases 
5. group V—allergic, endocrine system, metabolic С E Я 3 y 
and nutritional disenses 1 
6. group VI—disenses of blood and blood-forming 4 : " y- < 
organs 
7. group VII—mental, psychoneurotic and | 
personality disorders and diseases of the К : 4 M Ps 
nervous system and sense organs 4 4 . 
2 2 
8. group VIII—diseases of the circulatory system 
7 5 3 2 17 
9. group IX—influenza * it б , Е 
10. group X—bronchitis і к ^ E E 
ц. group XI—other respiratory diseases t » 4 4 "i 
12, group XII—diseases of digestive system Я b E i 
13. group XIIT—diseases of genito-urinary system 7 
14, group XIV—diseases of bones and organs of р Я 2 j S 
movement 2: P d ый 
15 ы — 
* group XV—other diseases 
4 67 90 30 11 198 


16. total ПД. ee 
x si NN 
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TABLE 5.2. THE EXTENT OF AGREEMENT BETWEEN THE HOSPITAL DIAGNOSES 
AND RETURNS OF THE NON-MEDICAL INVESTIGATORS 


SESS 


disease complete no doubtful not total 
agreement agreement. recorded 
(1) (2) (3) (4) (5) (6) 
i 6 
1. group I—tuberculosis (pulmonary) 3 3 —- — 

р 2 
2, group II—malaria 6 10 3 2 1 
Р B 
3. group III—dysentery 1 12 — 2 15 
4. group IV—other infectious and parasitic diseases — 10 1 — 11 


5. group V—allergic, endocrine system, metabolic 
and nutritonal diseases 


" 9 u 1 12 
6. group VI—diseases of blood and blood-forming 
organs 1 7 — = 3 
7. group VII—mental, psychoneurotie and 
personality disorders and diseases of 
nervous system and sense organs 2 9 = == y 
8. group VIII—diseases of the circulatory system 2 2 == 1 2 
9. group IX—influenza 5 6 1 5 17 
10. group X—bronchitis 6 17 3 4 30 
11. group XI—other respiratory diseases 3 11 = 1 15 
12. group XII—diseases of digestive system 26 8 — 5 39 
13. group XIII—diseases of genito-urinary system — — — — 2а 
14. group XIV—diseases of bones and organs of 
movement, 5 2 — 1 8 
15. group XV—other diseases Е — Рең = = 
Be ln N ee 
16. total 62 106 8 22 198 


9.9. Two types of discrepancies are possible in this situation. The first is that tho 
diagnosis entered in the hospital register does not tally with those obtained from the investi- 
gators’ reports and secondly, certain individuals who were known to have attended the 
hospital in connection with certain definite illness were not reported by the investigators 


probably due to recall lapse. Occasionally, 
tion with a certain illness might have been ill 
period. In such a situation if the investig: 


û person who attended the hospital in pae 
due to another illness also during the referen? 


, ital 
ators’ reports did not tally with the hospit® 
reports, it would not be possible to state clearly whether it was a misclassification OF an 


omission of the disease for which the enquiry was made, If, however, the date of onset ? 
any disease reported by the investigators preceded the date of hospital attendance it was 
assumed that the report related to the disease for which hospital aid was sou ght. On the 
other hand, if the date of onset reported was later than the date of hospital attendance: 
it was difficult to decide whether the investigators’ report related to the same disease аз was 
treated in the hospital or to a different one. In the latter situation the discrepancies were 
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assigned in a separate column ‘doubtful’ in Tables 5.1 and 5.2. Further, there were 


certain cases where there was complete agreement between the investigators’ reports and 


the corresponding entries in the hospital register, but the date of onset reported by the 


investigators was subsequent to the date of hospital attendance. But as these cases were 
either of a chronic or intermittent nature, it could be reasonably assumed that the investi- 
the same diseases as were treated in the hospital. It may be 
2 that out of 198 cases of illness 30 and 8 cases allotted to the 
vestigators respectively did not tally with the hospital entries 
the investigators’ reports related to the corresponding 
diseases for which the enquiry was made. Out of 168 cases for which medical investiga- 
tors’ reports could be tallied with the corresponding hospital entries, 11 were missed 
and 90 were misclassified, whereas for the non-medical investigators, out of 190 cases 
for which the reports could be tallied with the hospital entries 22 were missed and 
106 were misclassified. The percentage of cases missed by the medical investigators 
is only 6.5 per cent as compared with 11.6 per cent missed by the non-medical investigators. 


This indicates that there will be more response obtained from the informants if medical 
57 cases reported by the medical investigators which 


investigators are employed. Among 1 s : à 
could be tallied with hospital entries, 90 cases or about 57 por беш were misclassified, 
whereas among 168 similar cases reported by the non-medical investigators, 106 cases or 
about 63 per cent were misclassified. The above results suggest that both щ таро of 
extent of response from the informants as well as in the tên of correct classification, the 
performance of the medical investigators seems to pe slightly more у than that 
of the non-medical team. The extent of misclassification 1n the 15 groups of diseases 


individually are shown in table below. 


gators’ reports related to 
seon from Tables 5.1 and 5. 
medical and non-medical in 
for lack of knowledge, whether 


THE MEDICAL AND 
ASSIFICATION AMONG 
Же FERENT DISEASE CATEGORIES 


TABLI; 5.3. PERCENTAG 
0 ee TORS IN DIF 


NON-MEDICAL INVESTIGA' - | 
medical non-medical 
disease investigator investigator 
a @) 8) 
100.0 50.0 
1. group T—tuborculosis (pulmonary) 16.9 62.5 
2, group II—malaria 69.2 92.3 
3. grou 1II—dysentery m" 45.5 100.0 
4, чо IV—other infective and parasitic ш а 
5. group v—allergic, endocrine system, meta 22.2 81.8 
nutritional diseases ming organs 87.5 87.5 
6. group VI—diseases of blood and Шш stir 
я. choneurotic an per: 
E group VIL mones of nervous system E 75.0 81.8 
sense organs tem 50.0 50.0 
8. group VIII—diseases of circulatory SYS 41.7 54.5 
9. group IX—influenza 47.6 73.9 
10. group x bronchitis 88.9 78.6 
11. group XI—other respiratory diseases 45.5 95.5 
Я sae т 
12. group XII—diseases of digestive pet 100.0 
13. grou Sr le в gaiton of movement 42.9 28.6 
ч. ме XIV—diseases of bones and organs 0.0 zz 
15. group X V—other diseases 57.3 63.1 
16. total 
165 


Vou. 21]  SANKHYA: THE INDIAN JOURNAL OF STATISTICS [Parts 1 & 2 


5.10. That the extent of disagreement in the reporting of the diseases varies = 
the type of diseases investigated is quite evident from Table 5.3. Молоток, poe gr 
table singles out such types of diseases which are likely to be more often misreported D3 
the medical and non-medical investigating teams as well 


extent of agreement with the corresponding hospital di 
difference between the two investigating teams. 
(diseases of blood and blood-forming organs) 


disorders and diseases of the nervous system and sense organs) and 8 (diseases of the circu- 
latory system) show almost equal tendency to be misclassified by the medical and ше 
non-medical investigators. On the other hand, diseases belonging to groups 1 (pulmonary 
tuberculosis), 2 (malaria), 11 (other respiratory diseases), 12 (diseases of the digestive system) 
and 14 (diseases of bones and organs of movement) are generally misreported to a greater 
extent by the medical investigators and diseases belonging to groups 3 (dysentery), 4 (other 


infective and parasitic diseases), 5 (allergic, endocrine system, metabolic and nutritional 
diseases), 9 (influenza) and 10 (bronchitis) are similarly 
investigators. These results are important in as much 
of reporting by the medical and non-medical investi 
disease groups. In order to properly assess the v 
types of investigators and to inve 
of the data is essential. 


as such diseases for which the 
agnosis does not show conspicuous 
i i 1 oups 6 

For instance, diseases belonging to groups 
1 ура itv 
; 7 (mental, psychoneurotic and personality 


misclassified by the non-medical 
as they are suggestive of the quality 
gating teams with respect to different 
alidity of the rates obtained by the nv 
stigate the nature of misclassification, further examination 


Tables 5.4 and 5.5 give the two-way comparison of the reports 
of the investigators with the correspónding entries in the hospital register. 


5.11. In the above tables the diagonal entries represent those cases whore there 
is complete agreement between investigators! returns and the hospital diagnoses. The 
figures in rows indicate the extent of misclassification occurring for each type of disease taken 
from the hospital register, € purpose of argument that the reports ob- 
hey were made during the time of treatment 
ow divergent from the diagonal cell will show 
ion of the morbidity rate for this particular group of diseases due 


other hand, the entries in any column diverging from the diagonal 
cell will indicate the extent of over-estimation of the morbidity rate duo to inclusion of diseases 
of other categories in this group by the investigators, 

the hospital diagnoses need not be true and, therefore, 


the two types of classification can be interpreted on 
further. 


Tf we assume for th 


‚ the entries in each r 
the extent of under-estimat; 


to misclassification. On the 


ding 
Of course, this assumption regarding 
this study of the divergence ране 
ly as а lack of agreement and n 


deserve consideration. In the followi 


5.13. There is a general tendency noticeable for pulmonary tuberculosis 0 be 
invariably misclassified as either asthma or bronchitis, Other forms of tuberculosis are 2180 
usually misreported. It is also found that the non 


d 
-medical investigators have mp 
tendency to report non-tuberculosis cases as pulmonary tuberculosis cases resulting in ® 
exaggeration of the pulmonary tuberculosis rate esti 


mated from their returns. 


А ы ог 

ported as influenza 9r other respiratory diseases » 
И j 

n-malaria cases are not generally returned as malat 


166 


5.14. Malaria is sometimes re 
diseases of the digestive System and no 
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by both the medical and non-medical investigators. Hence it seems that the malaria 
a household canvass is likely to be an under-estimate. 

cases are usually misreported as some disease pertaining to the 
The non-medical investigators seldom report non-dysentery 
cases as dysentery cases. Consequently, the incidence rate for dysentery obtained by non- 
arors will tend to be biassed towards a lower value than the true one. On 
] investigators the rate obtained may be regarded as 
f the non-dysentery cases being returned 


Cases 


rate as obtained by 
5.15. Dysentery 
digestive system (group 12). 


medieal investig 
the other hand, in the case of medica 


almost nearly the true value mainly due to some о: 


as dysentery cases. 
5.16. The group ‘other infective 
‘This includes all infectious and parasi 


and parasitic diseases’ is evidently a heterogeneous 
tic diseases other than pulmonary tuberculosis, 
Naturally, one would expect much less discrepancy in this parti- 
medical reports are totally descrepant from the 
However, a number of diseases belonging to 
ted by the non-medical investigators as diseases belonging to 
he medical investigating team in respect of reporting diseases 


of this category seems to be somewhat satisfactory. About 55% of the cases are reported 
correctly and only one case belonging to another group has been brought into this category. 
Tt is likely, therefore, that the estimate obtained from the medical team will be erring on the 
The misclassification in this group usually arises due to neuritis cases being 
ver as bronchitis or other respiratory diseases. Probably, 
ter any disease of unknown etiology as neuritis which 
dical investigators as rheumatism or some 


one. 


malaria and dysentery. 
Curiously enough, the non- 


cular group. 
he hospital register. 


corresponding entries in tl 
other groups have been repor 
this group. The performance of ti 


lower side. 
classified as rheumatism, enteric fe 
hospital staff have & tendency to en 
on further examination is reported by the ше 
other specific disease. 

5.17. The performance 
allergic, endocrine, 


of the medical team in respect of reporting diseases belonging 
utritional diseases) seems to be fairly satis- 
of the non-medical investigators seems 
f the total number of cases allotted 
the rate based on their reports is 
he reason that a number of cases 
due to their peculiar nature 


to group 5 ( metabolic and n 
-factory. But the classification based on the returns 
to be far from satisfactory, only about 16 per cent 0 
to them having been correctly classified. Neroni E i 
very nearly equal to the one based on hospital RO zi R 
belonging to other groups have been brought into this category 


of reporting. 
5.18. Diseases of blood and bloo! 

seases of the digestive or geni 
There is less tendency on the part ot 
g to other groups аѕ diseases of this group 


d and blood-forming organs wi 


s like anaemia are most often mis- 
to-urinary system by the medical and non-medical 
of the non-medical investigators to classify 
with the result that the overall 
Il still remain grossly under- 


d-forming organ 


reported as di 
investigators. 
diseases belongin 
estimate of diseases of bloo 
estimated. 

5.19. Disease 
tend to be misclassified as 
though the rates are kept up b 


mental, psychoneurotic, nervous system etc.,) 
enito-urinary system or rheumatism 


s belonging to group 7 ( 
of other categories in this group. 


diseases of the digestive or g 
y the inclusion of diseases 


ost ators oa equal ext f di ment with th 
. a 
8 sl ent O а agree 


E 9, 10 and 11) are considered 
f jratory system (groups : ў dere 
5.21. When diseases 2 b e е reports of the medical investigators seems to be 


individually the classification 
169 


are concerned both the types | 
e hospital entries. 
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only moderately good. A closer examination will reveal that the misclassifications are mostl y 
confined within the three groups themselves so that if these three groups are combined into 
a single group representing all respiratory diseases, the reporting of the medical inves igators 
tends to be more satisfactory. But in the case of non-medical investigators, the misclassi- 
fications appreciably extend beyond the three groups and the resulting rate obt 


their reports is likely to be an under-estimate as the losses to these groups 
than the gains. 


ained from 
ave usually higher 


5.22. The diseases of the digestive system are very often misreported by the medical 
investigators as dysentery, or diseases of the genito-urinary system. However, this group 
gains at the expense of diseases like dysentery, anaemia and bronchitis. Hence, the overall 
rate obtained for this group seems to be very close to that obtained from the hospital register. 
There is a greater degree of agreement observed in the returns of the non-medical team. But 
the rate obtained from these reports appears to be grossly exaggerated due to the inclusion 
of a large number of cases of dysentery and respiratory diseases and to a lesser extent cases 
belonging to other disease groups in this category. 

5.23. The number of cases of diseases of the 
cords being very few, the nature of miscla, 
ever, a number of other diseases have been reported as 
medical and non-medical teams, indicating 
are likely to be grossly exaggerated. 


genito-urinary system obtained from 
ssification cannot be assessed. How- 
diseases of this category both by the 
that the rates obtained on the basis of these reports 


5.24. In respect of diseases of bones and organs of movement, tho extent of dis. 
agreement between the hospital diagnoses and the investigators’ reports Seems to be moderate 
for both the sets of investigators. But the rates obtaine 
this category of diseases are likely to be exaggerated apprecia 
this group of cases belonging to other groups. 

5.25. In the preceding paragraphs 
resulting from disagreement between the inv. and the corresponding hospital 
entries was slightly less in the cas ical i i 
gators. One may argue that the advanta, 
lowered due to the inclusion of non-preva 
à more thorough examination of the cases 
which would result in an appreciable im 
were so, we may reasonably assume, 


ве of using a medical investigator is considerably 
Ps cases, Probably, if the cases were prevailing 
could dida been made by the medical investigators 
provement in the quality of their reports. If this 


that for the Prevailing cases there should be an 
” reports and the corres- 
spec : 

| nos : pect only the cases of diseases prevailing 
at the time of Investigation have been considered and the evaluation of t} t f 

5 s d he agreer p 
the medical and non-medical teams have been made in Table 5 6 и 


5.26. The results show that an overall disagree 
recorded for the medical investigators as against 57.3 per ce m wh ; ili 
and non-prevailing cases were considered. When an asses, Mion Hotii thoy revailing 
is done disease.wise for the medical investigators, we fin 
the same as when the non-prevailing cases were also incl 


nent of 


uded for all the dic 
for diseases of the circulatory system, influenza, bronchi 19 disease groups Ox 


system. In the case of influenza, it has turned out to th Š 
of the circulatory and digestive systems and bronchitis © case of diseases 


ut to be hetter. 
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TABLE 5.6. PERCENTAGE MISCLAS 

o. у. 1 ASSIFICATION AMONG T. h 

NONMEDICAL INVESTIGATORS IN DIFFERENT der ue "init 
GORIES WHEN ONLY PREVAILING CASES WERE я 


disease 
Г medical non-medical 
investigator investigator 


1 
(1) (2) 5) 


1. group I—tubereulosis (pulmonary) 100.0 50.0 
5 Я 
2. group TI—malaria 
80.0 100.0 
3. group III—dysentery 79 т 
72.7 91.7 
4. group IV -_other infective and parasitic diseases 37.5 100.0 
5. group y —allergie, endocrine system, metabolic and 
nutritional diseases 1 29.2 7 
А : cna 77.8 
6. group VI—diseases of blood and blood-forming organs 100.0 83.3 
7. group VII—mental, psychoneurotic and personality 
disorders and diseases of nervous system and 
sense organs 83.3 88.9 
8. group VIII—diseases of circulatory system 33.3 33.3 
9. group IX—influenza 100.0 100.0 
10. group X—bronchitis 33.3 66.7 
11. group XI-—other respiratory diseases 100.0 88.9 
12. group XII—diseases of digestive system 33.3 29:6 
13. group XIII— diseases of genito-urinary system = کت‎ 
14. group XIV—diseases of bones and organs of movement 33.3 0.0 
15. group xX V—other diseases — = 
55.9 64.0 


16. total | v sor 


of the non-medical investigators revealed that the 
hat observed when both the prevailing and non- 
However, for diseases like malaria and in- 
hile for diseases of the circulatory system 
ce of the non-medical team was hetten 


5.27. The analysis in respect 
fairly well with t 
ailing cases were included in the analysis. 
the degree of disagreement was higher w 
gans of movement the performan 


ases were considered. 


divergence compared 


prev 
fluenza 
and of the bones and or; 
when only prevailing © 
5.28. While it is to be admitted that the 
medical investigators, there is no indication 


seems to be slightly superior to that of the non- 
that the degree of precision in reporting diseases will be enhanced if only the prevailing cases 
Б Cases 


were investigated PY the usual questionnaire method without taking recourse to other aids 


such as physical examination and laboratory tests. 
5.29. So far the analysis of the data were based on 15 diseases or groups of diseases 


A further condensation of groups though not desirable from the point of view of detail, is 


expected to result in а closer agreement between 
у 8 improvem in re i 

agnoses. Tn order to assess the improve! ent in reporting effected by grouping 

Ы 4 2 


erformance of the medi investi 
p medieal investigators 


the returns of the investigators and the 


hospital di: 
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the diseases on a still wider basis, the abo 


ve 15 categories were reclassified into 9 broader 
categories of diseases. 


This type of grouping no doubt introduces a greater degree of hetero- 
geneity within the groups. However, the results of the analysis based on such a broad classi- 


fication will indicate the level to which the extent of disagreement could be brought down. 


5.30. Table 5.7 gives the percentage of disagreement observed in the returns of 


the medical and non-medical investigators when they were compared with the corresponding 
hospital diagnoses, 


TABLE 5.7; PERCENTAGE MISCLASSIFICATION AMONG THE MEDICAL AND 
NON-MEDICAL INVESTIGATORS IN DIFFERENT DISEASE GROUPS 


disease group medical non-medical 


investigator investigator 
i ee ae Шш 
І. group I—infectivo and parasitic diseases 


67.4 64.4 
2. group II—allergie, endocrine System, metabolic and 
nutritional diseases 22.2 81.8 
3. group ITI—diseases of blood and blood-forming organs 87.6 87.6 
4. group IV—mental, psychoneurotie and personality 
disorders, diseases of nervous system and sense 
organs 75.0 81.8 
5. group V— diseases of cireulatory system 50.0 50.0 
6. group VI—diseases of respiratory System 26.2 47.9 
7. group VII—diseases of digestive system 45.5 23.6 
8. group VIII—diseases of bones and organs of movement 42.9 28.6 
9. group IX—other diseases 66.7 — 


10. total 


Y of reporting the cause 
has any bearing on the 
ousehod, the distribution 

PS according to the nature 
С Presented in Tables 5.8 and 5.9 respective- 
ly for both the medical and non-medical investigators. 


f : As most of the patients attending 
the out patients’ department of the hospital to which our data relate belong to the lower 


e members of his hi 
Ccupational grou 
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social groups the occupational stratification adopted in the analysis cannot be expected to 
show significant class differentiation. For instance, it was necessary to include a few pro- 
fessionals in the same group consisting of clerical and other low income groups in order to 
obtain an appreciable number in the ‘high’ occupational class. The rest being mostly manual 
labourers living in bustees had to be allocated to the two classes ‘medium’ and ‘low’ according 


to the skill involved in the jobs. 


DISTRIBUTION OF INFORMANTS ACCORDING TO EDUCATIONAL 
STATUS AND NATURE OF DISEASE CLASSIFICATION 


А ———————————_____—_______ 


TABLE 5.8. 


medical non-medical 
educational status 
complete no total complete no total 
agreement agreement agreement agreement 
(1) (2) (3) (4) (5) (6) (7) 
1. illiterate 4 5 9 10 24 34 
(44.4) (55.6) (100.0) (29.4) (70.6) (100.0) 
2. literate with no 
knowledge of 22 30 52 27 49 76 
English (42.3) (57.7) (100.0) (35.5) (64.5) (100.0) 
3. literate with 
knowledge of 41 55 96 25 33 58 
English (42.7) (57.3) (100.0) (43.1) (56.9) (100.0) 
4. total 67 90 157 62 106 168 
(42.7) (57.3) (100.0) (36.9) (63.1) (100.0) 


TABLE 5.9. DISTRIBUTION OF INFORMANTS ACCORDING TO OCCUPATIONAL 
STATUS AND NATURE OF DISEASE CLASSIFICATION 


non-medical 


medical 
occupational status nee. ES WI ое a ЖҮ 
agreement agreement agreement agreement 
(1) (2) (3) (4) (5) (6) (7) 
1. low 17 27 44 19 50 69 
(38.6) (61.4) (100.0) (27.5) (72.5) (100.0) 
Р 25 35 60 23 38 6r 
2. medium 
(41.7) (58.3) (100.0) (37.7) (62.3) (100.0) 
A 53 20 18 38 ^ 
3. high 25 28 
ME (47.2) (52.8) (100.0) (52.6) (47.4) (100.0) 
67 90 157 62 106 168 
4, total 
(42.7) (57.3) (100.0) (36.9) (63.1) (100.0) 
5.33. Аз far as the medical investigators are concerned, there is no evidence of 


any association between the accuracy of their returns and the educational or occupational 
This suggests that their method of interrogation was practically 


status of the informants. 
hat the informant stated about the nature of the disease as he understood 


independent of wl ( 
from the attending physician. However, it can be seen from the above two tables that 
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the returns of the non-medical investigators were to some extent influenced by the social 
status of the respondent indicating thereby, that the persons belonging to the higher social 
class are frequently appraised of the nature of the disease by the attending physicians. 


5.34. Two important factors emerge from the results of the Validity Survey dis- 
cussed in the preceding paragraphs. First, the inaccuracy arising out of misreporting 
of diseases is considerable in an investigation of this type and second, the accuracy of the 
disease reporting is not appreciably enhanced by the employment of medical investigators 
for this purpose. Further, the results shown in Tables 5.4 and 5.5 indicate the directions 
in which misreporting of diseases takes place which may be fruitfully applied in interpreting 
morbidity rates in respect of different disease groups. But it is necessary to emphasize 
in this context that the above Validity Survey included within its scope only such cases of 
diseases which were attended by an hospital. Tf a health survey is carried out in a popula- 
tion, it may be observed that a substantial number of diseases occurring in the population do 
not receive any medical treatment. It is only reasonable to expect a greater degree of in- 
accuracy in the reporting of such diseases which will only tend to make the situation worse. 
Moreover, there are no means of checking the validity of non-attended cases except by a 
prevalence survey by trained medical personnel. But such a survey will have to be 


necessarily restricted to chronic diseases because they alone can be expected to prevail in 
appreciable numbers at the time of investigation. 


5.35. "Though the results of the Validity Survey discussed above give only a partial 
picture of the inaccuracies in the morbidity returns, it is assumed that they may be of consi- 
derable value in the interpretation of the morbidity rates estimated from tl 
to the West Bengal Health Survey. A brief discussion of theso т 


пе data relating 
rates based on the results 
of the Validity Survey is attempted in the following paragrapl 


18. 


5.36. The West Bengal Health Survey showed a total of 604 cases of illnesses among 
the members of 1172 rural households and 351 cases of illnesses among the members of 566 


urban households during the three-month period of observation be 
ending in May, 1955. As could be expected, s 
to the first reference period and some continu 
reference period. In the former case, if the ill ture, only the appro- 
as noted in column 6 of block 7 of the 
oximate estimation of duration of ill- 


date w 
schedule. As ragards the latter, not even an appr 


nesses could be availed 


5.97. The allocation of illnesses into chronic and acute was done on the basis of 
fheir duration. All illnesses whose duration exceeded three months were classified as 
chronic and those illnesses which prevailed for periods shorter than: 8 months were tlassified 
as acute. The classification of such illnesses which were prevailing incip cage y 
and whose duration fell short of 3 months till that date was carried out with the help of 
medical experts. 


5.38. Since the exact time of onset and recover 
abrupt and recognisable, it is customary to define the 
pect of such diseases by the incidence rate which implie 
in a given interval of time among 1000 population. 
picture of morbidity and as such the preventive aspe 
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y of acute diseases are more or less 
morbidity of the population in res- 
d the frequency of new cases arising 
This gives a more or less dynamic 
cts are fully brought to light. 
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5.39. In the ca Я ic di 7 
able and sometimes ی م ی ا‎ МЕ i a ме. ae 
е dee ly nly at an advanced stage. It is 
possi e in suc on ses to calculate incidence rates and the best that can be d 
under the circumstances, is to define morbidity in terms of their prevalen T we n 
valenge rate is defined as the number of cases among 1000 population at à Ыз i we 
time, The prevalence rate is à useful measure of the extent of chronic bestes oa oa 
tion prevailing at a given time regardless of the date of onset of the diseases. This e E A 
10 doubt, gives only а cross-sectional picture of the morbidity of the population ays 
is very much influenced by the curative aspects such as effectiveness to reduce their рб i 
und the stage at which they are diagnosed. The incidence and prevalence rates for as 
and chronic diseases are presented in Tables 5.10 and 5.11 respectively. The dx : 
the rates given in these tables can, however, be assessed by a comparison of similar ls 
obtained from two independent sub-samples shown in Tables 01.2 and 01.3 in Appendix 1 


TABLE 5.10. INCIDENCE RATES FOR ACUTE DISEASES CLASSIFIED ACCORDING 
TO DISEASE GROUPS IN WEST BENGAL 


incidence rate per 1000 


disease group 
population in a year 


А rural urban 
(1) ^ (2 (3) 
1. group I—malaria 46.28 14.19 
2. group II—dysentery 26.54 36.26 
3. group III—diarrhoea, enteritis and other diseases of the 
digestive system 40.15 74.10 
4, group LV—other infective diseases of intestinal tract 
0.85 typhoid, cholora, diseases due to helminths, etc. 10.21 23.65 
5 
5. group yv—aneasles, mumps, small pox, chicken pox 25.86 22.07 
6. group VI—common cold, tonsilitis, influenza, fever, 
pneumonia, bronchitis and other respiratory 
diseases 140.87 179.72 
7. group VII—eye, ear, boil and abscess, cellulitis and 
dental diseases 25.18 36.26 
S. group VIII—others (e.g. anaemias, V-d., vascular 
lesions affecting central nervous system, rheumatic 
fover; appendicitis, congenital malformations, 
accidents, ote.) 12.93 36.26 
. ИЙ мый... в 
328.02 422.51 


9. total 
ER roo 
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TABLE 5.11. PREVALENCE RATES FOR CHRONIC DISEASES CLASSIFIED ACCORD- 
DING TO DISEASE GROUPS IN WEST BENGAL 


errr 


prevalence rate per 1000 
population 
disease group 
rural urban 
(1) (2) (3) 

1. group I—tuberculosis (pulmonary) 1.68 3.77 
2. group II—diseases of the circulatory and nervous 

systems viz., arteriosclerotic and degenerative 

heart diseases, hypertension, diseases of veins, 

rheumatic fever, psychoneurosis, diseases of nerves 3.69 2.02 
3. group III—diseases of the eye, ear, skin, bones and 

joints 4.02 5.03 
4. group IV—diseases of the stomach and duodenum 

except cancer 2.68 5.87 
5. group V—asthma 3.52 4.61 
6. group VI—diseases of the genital organs, fistula 2.85 4.61 
7. group VII—others, (е.5., v.d., cancer, diabetes, avita- 

minosis, nephritis, congenital and functiona! 

diseases, etc.) У 2.01 8.39 
8. total 


т 
20.45 34.80 


5.40. At the outset of the analysis it was our intention to strictly adhere to the 
International Statistical Classification of Diseases and Injuries (List C—Special List of 50 


causes for tabulation of morbidity—W.H.0., 1948). Subsequently, it was found from the 


nature of the data collected that even such an abridged list was too detailed for obtaining 
any reliable morbidity rates. The classification of diseases in the above analysis had, 


therefore, to be considerably condensed without appreciably damaging the essential features 
of the prevailing morbidity pattern in West Bengal. 

5.41. The most revealing feature of thi 
acute or chronic, occur more often amongst th 
exceptions being malaria and diseases of th 
cold, influenza and other diseases of the respi: 
diseases during the reference period both in 


€ above tables is that the diseases, whether 
e urban than in the rural residents, the only 
e circulatory and nervous systems. Common 
ratory system are the most. commonly reported 
an areas. 


of the factors likely to influence morbidity returns. 
the age-sex composition, the level of health conscio 
care services in the communities to be compared. $ 
sion as to the relative healthiness of two areas without, pro 


of diseases are appreciably affected by errors due 
results of the Validity Survey. 
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5.43. The age-sex compositi у 
БР. g nposition of the rural and urban samples are shown in 
AGE-SEX DISTRIBUTION OF THE SURVEYED POPULATION 


TABLE 5.12. 
ее 
age (in years) rural 
urban 

persons 152 (2.55)! 5 2 

1. less than 1 males 78 (1 г. E os 

27 : 

females 74 (1.24) 26 (1 ie 
persons 770 (12.91) 273 

. 273 (11.45 

2, 1—4 males 396 (6.64) 143 (6 pH 
females 374 (6.27) 130 (5.45) 
persons 1478 (24.77) 522 (21.89) 

3. 5—14 males 808 (13.54) 978 (11.66) 
females 670 (11.23) 244 (10.23) 
persons 3566 (59.77) 1537 (04.44) 

4. 15 and above males 1804 (30.24) 888 (37.23) 
females 1762 (29.53) 649 (27.21) 
persons 5966 (100.00) 2385 (100.00) 

5. allage-groups males 3086 (51.73) 1336 (56.02) 


1 Figures in parentheses are percentages 


e table it appears that the rural and urban samples had more 
from which it follows that the urban-rural differentials 
o the difference in the age-sex composition of their 


5.44. From the abov 
nge-sex composition 


or less similar 
ould not be ascribable t 


in morbidity e 
populations. 


5.45. A further br 
cupational status W 
attempted here as 


eakdown of the morbidity rates by age, sex, living conditions 
ould, indeed, be helpful in preventive publie health 


educational and 0¢ 
the scope of the survey did not allow such a 


work. This was not 
detailed study. 


5.46. Lal and Seal (loc. cit.) have given morbidity rates for certain principal 


d acute diseases. Tt may be of interest to make broad comparisons between the 
ated from the data of West Bengal Health Survey, 1955, and the 
1944. It is well known that some of the acute diseases have a dis- 
Tt is, therefore, necessary to allow for this seasonal influence while 
nber of cases that may be expected during the whole year. No 
onal fluctuations need be made in respect of the Singur Health 
n collected relates to one year. 


chronic an 
morbidity rates estim: 
Singur Health Survey. 
tinct seasonal pattern. 
estimating the total nur 
such adjustment for seas 
Survey data, because the informatio 
547. The morbidity rates of certain important diseases estimated from the rural 
ed with the corresponding rates obtained 


data collected in this survey have been compar 
from the Singur Health Survey in Table 5.13. То make the comparison strictly valid for 


such diseases as have ® seasonal pattern appropriate adjustments have been made. It 
could be observed that there is à fair degree of agreement between the two sets of figures 


The incidence rate for malaria estimated from the present survey even after accounting 
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for seasonal influence, is strikingly low in comparison to the one obtained by Lal and Seal 
for Singur. It was stated earlier while discussing the Validity Survey results that mala- 
ria showed a tendency to be misreported as other diseases and that diseases other than 
malaria were less likely to be returned as malaria. 
be operating to the same extent among the medical and non-medical investigating teams. 
It is, therefore, reasonable to assume that the difference in the 
employed in the two surveys could not have resulted in a divergence of the magnitude 
Shown in Table 5.13. Hence, it may be resonably assumed that the difference between 
the two malarial rates observed is a real one. This is only natural because Singur during 
the forties was a highly malarial place, though today malaria has been practically controlled 
there. Moreover, there is a gap of about a decade between the two surveys during which 
time a reduction in malaria incidence in West Bengal could have taken place due to better 
health measures. As regards the incidence of measles, the Singur rate appears to be 
higher than the rate in this survey. The higher rate for measles observed in the Singur 


population might have been probably due to its high density of population and its proxi- 
mity to such congested areas as Howrah and Calcutta 


The above tendency was observed to 


types of investigators 


5.48. It has been pointed out earlier th 


at the performance of the medical investi- 
gators w: 


às more satisfactory than that of the non-medic 
piratory diseases considered as a whole. 
was usually an under- 


al investigators in respect of res- 
The rate obtained from the reports of the latter 
estimate as diseases belonging 
as diseases belonging to other groups. 
from the data relating to the West 
falls short of the corresponding rate 
latter was based on medical investi 
shown in Table 5.13 is not statist 


to this group were more often returned 
Tt is, therefore, not unlikely that the rate estimated 
Bengal Health Survey for pneumonia 
estimated from the $ 
gators' reports. 
ically significant. 


and influenza 
ingur Health Survey data as the 
However, the difference in the rates 


TABLE 5.13. COMPARISON OF THE RESULTS OF THE WEST BENGAL HEALTH 
SURVEY (RURAL) AND THE SINGUR HEALTH SURV 


"EY 
E ER e sre aera 


annual morbidity rate per 1000 population 

disease ы ч 
9 West Bengal Health Survo: 7, 1955 

Singur Health ii SEDEM 

Survey, 1944 before aftor 

adjustment for adjustment for 

Seasonal pattern seasonal pattern 

(1) (2) (3) (4) 


acute disease 
1. malaria 


46 154 

2. dysentery and diarrhoea 38 56 ea 
3. measles 42 5 n 
4. pneumonia and influenza 12 6 М 
chronic disease 
5. tuberculosis (pulmonary) 1.09 1.68 
6. asthma 2.82 3.52 
7. diseases of the cireulatory and 

nervous system. 3.44 3.69 
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5.49. With regard to certain chronic diseases also it was noticed that the preva- 
lence rates estimated from the West Bengal Health Survey data were in close correspon 
dence with those estimated from the data of the Singur Health Survey iei 


5.50. As before, it is necessary to interpret the prevalence rates obtained from the 
two surveys in the light of the results of the Validity Survey. Tt was found that the medical 
investigators misreported all the tuberculosis cases as belonging to some other diseases 
The non-medical investigators also misreported a substantial number of е нение 
(pulmonary) cases. But they exhibited a tendency to include a number of cases of other 
diseases in the tuberculosis (pulmonary) group leading to an exaggerated prevalence rate 
for this group. Whatever may be the direction in which misclassification of tuberculosis 
(pulmonary) cases tend to oceur, it seems that the only method of assessing accurately the 
prevalence of pulmonary tuberculosis is by complete physical examination of the population 


surveyed. 

In respect of allergic diseases like asthma, etc., the medical investigators’ 
rior to that of the non-medical investigators. As 
atory systems the extent of agreement seemed 


5.51. 
performance was found to be far supe 


regards diseases of the nervous and circul 
to be almost the same for both the groups of investigators. But the rates based on the non- 
medical investigators are likely to be exaggerated on account of including in each of the 
diseases belonging to other groups. 


5.59. In diseases of a chronic nature such as T.B. where there is no abrupt onset 
of a diseased condition in the affected individuals the estimated morbidity rates should 
be taken as corresponding to clinically diagnosed diseases or those causing severe disability. 
Moreover, there are other reasons which tend to under-estimate the morbidity rates of such 
For instance, there is & certain amount of time-lag between the actual onset of 
ime when medical diagnosis is sought. The degree of disability or dis- 
afiliction and the level of health consciousness of the subjects are 
rs which largely determine the stage at which the disease is 
a medical diagnosis. It is, therefore, inevitable that some of the cases go 
he operation of these and similar factors. In some acute diseases 
he symptoms manifested by these diseases due to their 
ignorance and low health consciousness. Hence, interpretation of morbidity rates have 
to be based on a proper appreciation of the factors involved. Ti the report of the Sickness 
Survey conducted in U.K. by the Ministry of Health (1946), it was observed that out of 
a sample of about 7,000 population, 5,518 or 79% suffered from one or more illnesses or 

ng à three-month period. Pearse and Crocker (1944) in their study "The 
ing structure of society’ have also arrived at 
about 10 per cent of the population on which 


an health overhaul was done was without any sign of disorder and the remaining 90 per 
cent were either in disease or in whom disorder was associated тшн sense of well-being. 
As against this, the estimated number of cases of illnesses ko injuries during the three- 
month period in rural West Bengal was 604 out of а sample of 5,966 persons i.e., 10 per 
cent and in urban West Bengal the corresponding number was 351 out of a sample of 
2,385 persons i.e. 15 per cent. 


above groups, 


diseases. 
a disease and the t 
comfort arising out of an 
among the important facto 


subjected to 
unaccounted due to $ 
the subjects may fail to recognise t 


injuries duri 
Pekham Experiment— & study of the liv 


similar results. They estimated that only 


nates of West Bengal and U.K. is indeed 


ontrast between the estir 
of U.K. are less healthy than those of West 


5.53. The c 
striking. The results suggest that the people 
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Bengal which contradicts the prevailing notion about the relative levels of health of Hes 
two populations. "These findings have to be interpreted in the light of the health re 
ness of the subjects which is essentially a concomitant of their levels of living. As there 
is no well defined line of demarcation between the state of health and that of disease of an 
individual it is likely that the morbidity returns obtained from an investigation of this type 
are influenced appreciably by the level of health consciousness of the community. In 
our country where the degree of health consciousness is known to be low, there is à natural 
tendency to overlook minor ailments and report only such conditions which cause perm 
discomfort, or disability to the subjects. Hence, a substantial number of illnesses might 
not have been reported at all. As the morbidity data collected by means of interrogation 
of the individuals are affected by a considerable degree of subjectivity, the only means of 
assessing the extent of morbidity seems to be a prevalence survey carried out on the basis 
of a complete physical examination supplemented by laboratory tests, if nec essary. 


CHAPTER 6 


DISABILITY 


6.1. In the preceding paragraphs, discussion was chiefly confined to the frequency 
of incidence and prevalence of diseases among the rural and urban populations of West 
Bengal and their classification by causes. In what follows, an attempt is made to describe 
briefly the question of disability arising out of these diseases and their social consequences. 

6.2. Though it is desirabl 
disability and non-disability, it w: 
what is referred to as days of disa! 
illnesses. 


e to split the duration of disablin, 
as not possible to do so 
bility hereinafterwards ig 


g illnesses into days of 
with the data in hand. Hence, 
actually the duration of disabling 


6.3. As stated earlier, illnesses, both chronic and acute, were divided into three 
classes according to the nature of 


disability caused by them, namely, (i) non-disabling 
(ii) disabling but not causing confinement to bed or hospital and (iii) causing confinement 
to bed or hospital. 


6.4. The illnesses which did no 
fied as disabling (case (ii) above), if the ill 
ing of medical care or special diet, 


t cause confinement to bed or hospital wero classi- 
Inesses led to either Stoppage of usual work or avail- 


particular assig, 
household. Hence, for a critical evaluation of disabilit 


cussion is limited to persons in the age 
persons belonging to this age-group are 
duties, the disability arising in this segme 
and social consequences. 


nment of work in or outside the 
У and its consequences, the dis- 
"8100р 15-59 years, Moreover, since most of the 
either in the labour force or engaged in domestic 
nt of the population may have inevitable economic 


6.5. The total number of disabling illnesses and their 


н proportion to total illness 
occurring to the rural and urban populations in the age 


"Eroup 15-59 years are shown in 
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Table 6.1. Тһе reliabilit 
р за у of these figures сап be а i 
sub-sample estimates shown in Table 01.4 of Eee. a TTA 


TABLE 6.1. ILLNESSES OCCURRING DURING THE REFERENCE PERIOD 
CLASSIFIED INTO TYPE OF DISABILITY IN THE AGE GROUP 


15-59 YEARS 

sector non-disabling disabling total percentage of 
illness illness illnesses ой а ИЕ 

illnesses 

(1) (2) (3) (4) (5) 

1. rural 115 217 332 34.64 

б чират 49 135 184 26.63 

3. total 164 352 516 31.78 


E т и 
abling illnesses to total illnesses are about 35 per 


cont and 27 per cent for the rural and urban populations respectively. In the survey of 
sickness of the population of U.K. (loc. cit.) it was observed that out of 4667 cases of 
independent illnesses occurring to the adult population in the sample, 4237 or 91 per cent 
were of à non-disabling nature or had duration of disability for less than a day. In a 
morbidity study carried out in the Eastern Health District of Baltimore during 1938-43, 
it was observed that 53 per cent of total cases of all ages were non-disabling. A comparison 
of the West Bengal Survey results with those pertaining to the U.K. or Baltimore clearly 
indicates that a number of non-disabling illnesses have not been reported in the West Bengal 
Survey, a substantial fraction of which, it may not be unreasonable to attribute to the 


evel of health-consciousness of the people. If by some method this unknown number 
dded to the already reported non-disabling cases, one could have 


te of the number of people in an indifferent state of health, who 
$ in the economic activities in which they are TT 
ged. 


6.6. The proportions of non-dis 


low 1 
can be estimated and а 


had an approximate estima 
could not pull their full weigh 


6.7. The results presented in Table 6.1 need not necessarily reflect the real extent 


onomie implications of disability to the community. А better measure 
ation of disability due to various causes and their frequency 
xtent of human wastage which otherwise could have been 


utilised for productive purposes. For this purpose, the duration of disability due to each 
kind of illness falling strictly within the reference period was cumulated over the four 
reference periods and inflated four times to yield annual estimate of number of days lost 
due to disability arising from each type of disease. No attempt was, however, made to 


$ for the seasonal peculiarity of the survey period. In Table 6.2 are shown the total 
in a year for the age-group 15-59 years in the surveyed population 
mples in Table 01.5 in Appendix 1. 


of the social and ec 
of this may be the dur 


of occurrence indicating the e 


adjus' 
days of disability 
Similar results are given for the two sub-sa 


6.8. Of the acute diseases, 
s of the digestive system, disease: 


malaria, dysentery, diarrhoea, enteritis and other 


s of the respiratory system and boil, abscess, cellu- 


disease LACAN 
litis etc; are the principal diseases causing disability in both the rural and urban areas 
As could be expected malaria accounts for a higher annual rate of disability in the rural 

rban areas. Similarly, diseases of the digestive system (other than diar- 


areas than in the u 


thoea and enteritis is ete., are associated with higher annual 


) and boil, abscess, cellulit 
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TABLE 6.2. TOTAL DISABILITY DAYS IN A YEAR AND DISABILITY рас TER 
PERSON IN A YEAR IN THE SURVEYED POPULATION AGED 15-59 YEARS 


ТЕ ees 


urban 
disability due to 


total disability disability days total disability disability days 


days in a year per person days in a yoar per person 
` in a year in a year 
a) 2) (3) (4) (5) 
acute diseases : 5 
(i) malaria 1,108 0.34 372 0.26 
(ii) dysentery 572 0.17 293 0.20 
(iii) diarrhoea and enteritis 246 0.07 409 0.28 
(iv) other acute diseases of 
digestive system 738 0.22 139 0.10 
(v) acute diseases of respiratory 
system including fover 2,448 0.75 1,267 0.88 
(vi) boil, abscess, cellulitis and 
other skin infections 3,183 0.97 692 0.48 
(vii) other acute diseases 1,969 0.60 1,669 1.16 
all acute diseases 10,264 3.12 4,841 3.36 
all ehronie diseases 14,965 4.55 14,600 10.10 
all diseases 25,229 7.67 19,441 13.46 


disability rates amongst rural persons. 
diarrhoea and enteritis is higher amor 
due to dysentery, howe 
urban groups. 


On the other hand, the rate of disability due to 
ngst the urban population. The rate of disability 
ver, does not show any sharp differential between the rural and 


6.9. The overall annual rate of 
or chronic diseases is more for the urba; 
estimate arrived at for the urban sect 
in U.K. (16.8 days per adult annuall 
District of Baltimore (15.9 days per 


disability in terms of days lost due to either acute 
n sector than for the rural sector. 


Comparing the 
or with those obtained for the 


Sickness Survey 


figure seems to be 
8 of disability due to acute diseases 


For example, an urban adult on 
an average loses nearly twice the number of days on ac 


chronic diseases as compared with a rural adult. Thus, ¢ 
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CHAPTER 7 


MEDICAL AND MATERNITY CARE 


7.1. Medical care. The frequency with which diseases occur and their duration 
and the nature of disability in a community are no doubt of great value to the public 
health administrator. But they describe only one side of the health picture of the popula- 
tion. On the other hand, the amount of medical care available to the population roughly 
indicates whether such facilities are sufficient to cope with the morbidity situation. NOR 
the extent to which they are utilized by the afflicted individuals suggests how the ber 
of recovery is affected. Thirdly, knowledge as to who are the actual beneficiaries of the 
existing medical set-up will considerably help in planning the distribution of medical benefits 


to the population. 

7.2. Since the achievement of independence of India, the importance and urgency 
of providing adequate medical care in its curative and preventive aspects are increasingly 
realised. The Health Survey and Development Committee (loc. cit) rightly points out that 
‘a nation's health is perhaps the most potent single factor in determining the character 
and extent of its development and progress and any expenditure of money and effort on 
improving the national health is a gilt-edged investment yielding immediate and steady 
returns in increased productive capacity. . . . The provision of adequate health protection 
to all covering both its curative and preventive aspects, irrespective of their ability to pay 
.are all facets of a single problem and call for urgent attention.’ 

7.3. The situation as it exists today is far from satisfactory. The high incidence 
of preventable diseases and the heavy toll of life taken by these diseases, the abnormal infant 
and maternal mortality, the widespread existence of malnutrition and under-nutrition, 
able housing conditions, and grossly inadequate preventive and curative health 
services are important features of the present health picture of the population. A comparison 
of the existing medical facilities in our country with those available in the more progressive 
ntries like the U.K. or the U.S.A. (Table 7.1) will reveal how inadequate the available 


for it,.. 


deplor 


cou 

facilities are. 

MEDICAL PERSONNEL AND HOSPITAL FACILITIES IN 
INDIA, U.S.A. AND U.K. 


ee On nn ee 
inhabitants per 


TABLE 7.1. 


" ear 
country y physician midwife pharmacist hospital bed 
п) (2) (3) (4) (5) (6) 
ب ی ر‎ 
U.S.A. 1953 750 4001 1,600 100 
nu 1081 1,150 4,550 3,500 no 
India 1952 5,700 23,000 18,700 250 
س‎ ——-- 
Мыз» ee 


1 Refers to graduate nurses in 1954. 
ear from the above table that India lags far behind the U.S.A. and the 


74. It is el í 
el and hospital beds. The worst sufferers 


U.K. as regards the availability of medical personn 
.K. as reg 
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are the rural population of India. They form about 80 per cent of the total population 
but hardly 30 per cent of the doctors are available to them. The situation is equally bad 
with respect to hospital beds and dispensaries. Further, as the rural population is widely 
dispersed with no adequate transport facilities, accessibility to the medical personnel, 
hospitals or dispensaries is considerably restricted. 


7.5. As an auxiliary to the present health survey, an investigation into the avail- 
ability of medical facilities in rural areas was conducted in 69 villages selected for the survey. 
The results of the investigation shown in Table 7.2 help to evaluate approximately the 
extent of medical care available in rural West Bengal at present. A comparison with official 


figures for all West Bengal will reveal the rural-urban differential in respect of the avail- 
ability of medical personnel. 


TABLE 7.2. THE AVAILABILITY OF REGISTERED MEDICAL PRACTITIONERS IN 
THE SURVEYED RURAL POPULATION AND IN WEST BENGAL 


س ڪڪ 


no. of total no. of regd. doctors inhabitants per regd. doctor 
villages population 
population in * of allopath all allopath all all West 
sample villages systems systems Bengal, 
^ 1951 
(allopath)! 
(1) (2) (3) (4) (5) _ (6) (7) (8) 
1. less than 1000 41 19,827 2 2 9,414 9,414 
2. 1000—1999 14 18,707 + 6 4,677 3,118 
3. 2000 and above 14 61,658 19 21 3,245 2,936 
4. total 69 100,192 25 29 4,008 3,455 1,318 - 


+ Statistical Abstract of West Bengal—1952. 


7.6. The usual way of presenting 
population as in Table 7.1 cannot be considered as a 


these people, a descrip- 
give a true picture of medical care at 


population who are the vulnerable groups from the point of view of morbidity, might not 
be able to avail of any sort of medical treatment for pecuniary reasons. Therefore, to 
obtain a true picture of medical care pattern in а community, it is essential to have 
besides am assessment of such facilities available to the community, an assessment of the 
extent of facilities actually availed by the communit ^ 


7.7. During the course of the three- 
482 illnesses of an acute nature and 122 Ише: 
the canvassed rural households. Similarly, 


month observational period information on 
p 9f à chronic nature were gathered from 
during the same period, data on 268 acute 
illnesses and 83 chronic illnesses were collected from the canvasseq urban households. The 
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proportion of illnesses receiving medicz 
g al treatment and type of such t 
T A 2те "mi "Nah тела 
m Table 73. The reliability of these estimates can, however, be assessed 7 E ZR 
similar estimates obtained from the two sub-samples shown in Table 01.6 of A) а PL sud 
d ppendix 1. 


TABLE 7.3. PERCENTAGE DISTRIBUTION OF DISEASES OR INJURIES ACCORD 
TO TYPE OF TREATMENT RECEIVED ic e 


—————————————————————————————————————— 


1 
type of treatment a urban 
acute chronic acute chronic 
e а) @) (3) à m 

1. allopath 39.42 46.72 42.54 79.52 

2. homeopath 16.18 11.48 23.88 12.05 

3. ayurved or unani 6.22 9.84 3.36 7.23 

4. quack or no treament 39.83 40.98 33.58 25.30 

5. total 101.651 109.02 103.36 124.10 
(482): (122 (268) (83) 


1 Porcentages will add up to more than 100, as some cases received more than 


ono typo of treatment. 
2 Figures in parentheses are the numbers of cases reported during reference 
period. 
7.8. It is found that about 41 and 51 per cent of the cases in the rural and urban 


availed allopathic treatment whereas only about 7 and 4 per cent of the 
1 of medicine. About 15 per cent of the rural cases 


an cases availed homeopathic treatment. All these suggest that 
rent is more commonly availed by the population even in the 
homeopathic and Indian system of medicine, the former is the 
ortion of cases treated. Of course, the popularity 
bined effect of efficiency, cost and availability. 


areas respectively 
cases took recourse to the Indian systen 
and 21 per cent of the urb 
allopathic system of treatn 
rural areas. Between the H 
more popular one judging from the prop 
y system of treatment is the com 
fact brought out clearly by the above table is regarding the proportion 
Tt is found that only about 40 per cent of the rural cases 
and about 32 per cent of the urban cases did not avail treatment from any recognised 
medical system. This is really surprising in view of the fact that even in such advanced 
countries like the U.K. or Canada where health services have reached a high level of 
devolopment, the proportion of cases not seeking medical care is much higher. For instance, 
in the sickness survey done in the U.K., (loc. cit.) it was observed that about 60 per cent 
of the cases did not avail of any kind of treatment. In the Canadian Sickness Survey, 
1950-51, it was estimated that out of a total of 29,471 complaint periods 21,134 or about 
nt received no health care. In contrast to these estimates, the West Bengal 
hown a very low figure for the proportion of cases, not availing any 
Considering that in West Bengal as in other parts of India, there is 
ersonnel, hospital beds and other treatment facilities compared to 
d Western countries, it is somewhat difficult to reconcile the observed 
Janation for this, as has been stated earlier, may be found in the 
or injuries causing little or no disability and for which 


of ап 


7.9. Another 
of illnesses medically attended. 


72 per ce 
Health Survey has 8 
medical treatment. 

a paucity of medical p 
those medically advance 


result. The possible exp. 
tendency to omit minor illnesses 
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perhaps no medical care was sought. If by some means it is possible to estimate such 
omissions, the proportion not availing any medical care is bound to go up. 


7.10. The same situation could be viewed from another angle to ascertain the 
real extent of medical care availed. It is not unreasonale to assume that the morbidity 
rate in West Bengal is higher than that of the U.K. or the U.S.A. and that almost all 
cases medically treated in West Bengal are generally reported. Under the circumstances, 
the ratio of treated cases to the total population will furnish a better appraisal of the extent 
of medical care availed. In West Bengal it was found that 650 cases out of a total of 
955 cases occurring in a period of 3 months to 8351 persons comprising the rural and urban 
populations, received some kind of medical attention. In other words, 7.8 per cent of the 
population could avail of medical care. The corresponding figure as revealed in the Sickness 
Survey in U.K. (loc. cit.) is about 31 per cent and that of the C 
1950-51 (loc. cit.) is about 53 per cent. That is suggesti 


actually seeking medical advice is si 
that about two. 


anadian Sickness Survoy, 
ve of the fact that the number 


gnifiantly low inspite of indications to the contrary 
-thirds of the cases are medically attended. 


7.11. It would have been useful to a: 


nalyse the data used in Table 7.3 by further 
breakdowns for disease groups, 


occupations etc., for a proper appreciation of the medical 
care pattern availed by the community. The scope of the available data, howover, restricts 
an analysis of this nature. 

7.12. Those who did not avail of any sort of medical treatment during their illness 
were further asked to state the reason(s) for not doing so. In both the rural and urban 
groups about 41 per cent of such persons attributed it to 
About 33 per cent of the unattended rural cases stated that ‘medical care was too expensive’ 
whereas the comparable figure for the urban g 


roup was only 6 per cent. It is natural that 
the abject poverty of the rural population only tend to make medical care too expensive. 


7.13. Another important aspect of medi 
on medical treatment. Here again, a detailed 
incurred with respect at least to the more comm: 
But for reasons stated earlier such an analysi; 


sickness being ‘not serious’. 


cal care is regarding the expenses incurred 
analysis showing the average expenditure 


only occurring diseases will be really useful. 
S is not attempted, 


7.14, The cost of medical care in terms of expenditur 
urban case than for a rural case. This is evident from Table 7,4 which gives the expenditure 
incurred per case during the observational period of three months for different types of 
treatment. Some idea of the reliability of the estimates с 


ап be had by comparing the two 
sub-sample estimates given in Table 01.7 of Appendix 1, 


e incurred is higher for an 


7.15. The higher cost of medical care in to 


. Y become advanced in stage to migrate 
to urban areas in quest of better treatment. Obviously, a higher expenditure is involved 
in the treatment of such cases. 


7.16. It is probably reasonable to bear in mii 
that they are likely to be exaggerated because the usual tendency is to include not only 
the actual expenses incurred during the period under review 


but also expenditure relating 
to some previous period cleared off during the reference period. 


nd while interpreting these figures 
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TABLE 7.4. MEDICAL EXPE 
Ж. XPENDITURE (IN RS.) IN 
a. D .) INCURRED 
THE REFERENCE PERIOD ACCORDING TO NATURE ee Gauges a oe 
TYPE OF TREATMENT AVAILED Ыы in 


— 
rural үрү 


typo of treatment 


acute chronic acute chronic 
1 р 

(1) (2) (3) (4) (5) 
1. allopath 8.87 36.89 28.96 75.90 
(190): (57) (114) (66) 
2, homeopath 4.65 15.79 13.45 45.38 

E 0. 
(78) (14) (64) (10) 

m nyurved or unani 5.30 33.23 80.11 37.9. 
33.2 . 37.95 
(30) (12 (9) (6) 
4, quack or no treatment 1.51 2.79 7.28 7.39 
(192) (50) (90) (21) 
5. total 5.18 23.46 20.66 70.43 
(482)? (122) (268) (83) 


1 Figures in parentheses are the numbers of cases on which tho estimates 


aro basod. 

2 Totals will not tally as some cases received more than one type of treatment. 
Considerable attention is being paid in recent yoars for 
and protection of the health of the mother and child. Comprehensive 
launched for the training of maternal and child health personnel like 
dhais, midwives, health visitors, nurses ебе. in appreciable numbers in order to raise the 
oxisting maternity services to a satisfactory level in a short time. The Second Five Year 
Plan envisages the establishment of numerous health centres to look after the interests of 


the mother and child. 


7.17. Maternity care. 


the promotion 
schemes have been 


d in this survey to assess the extent and type of 


7.18. Data have been collecte 
and the results are briefly summarised in the 


maternity services availed by the population 


following paragraphs. 
7.19. Those pregnancies terminating during the three-month observational period 


were referred to 48 current terminations in block 8 of the schedule. Only for such termi 
nity care received have been collected. This 


ailed information regarding mater: 
restriction had to be imposed because such detailed information could not be elicited if th 
e 


events related to the distant past. However, by the above restriction the sample became 
extremely inadequate to yield any reliable estimates. For such studies, therefore, a special 
survey has to be carried out including only those households in which births are known to 


have occurred. 
7.20. Of the 45 births taking place during the period under review, 25 births took 


o rural mothers and 20 births to urban mothers. The inadequacy of trained 
available at deliveries, particularly in rural areas, is clearly boe 
bution of deliveries according to agencies attending idis 


nations det: 


place t 
professional 
by Table 7 DW 


assistance 
hich gives the distri 
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TABLE 7.5. PERCENTAGE DISTRIBUTION OF CURRENT TERMINATIONS 
ACCORDING TO TYPE OF ATTENDANCE 


X ee 
midwife 


relatives 
sector doctor or nurse dhai hospital and total! 
(qualified) friends 
a) (2) (3) (4) (5) (6) (7) 
1. rural 8.0 0.0 76.0 0.0 24.0 108.0 
2. urban 25.0 10.0 25.0 35.0 20.0 115.0 


1 Percentages will add up to more than 100 as some deliveries rece 
type of attendance. 


ived more than one 

7.21, As may be seen from the above table, dhais attended about three-fourth 
of rural deliveries and one-fourth of urban deliveries, All the rural deliveries were non-insti- 
tutional, whereas 35 per cent of urban births took place in hospitals. 
of doctors, and qualified midwives or nurses were 
The corresponding figure for the urban cases was 35 per cent. No professional assistance was 
called for in 24 per cent of the rural cases and 20 per cent of urban cases. Though these 
estimates are based on small numbers there is no gainsaying that rural popul 
rely largely on primitive and untrained agencies for this purpose. 


Professional service 
availed only in 8 per cent of rural cases. 


ations have to 


7.22. Tt was observed that as high as 96 per cent of the rural births and 8 
of the urban births were delivered within the same district. 


7.23. On an average, medical expenses which may include payment for the services 
of a doctor, hospital, midwife, nurse, or dhai or cost of medicine was about Rs. 9 in the 
case of a rural birth, and Rs. 22 in the case of an urban birth. The higher average cost 
incurred in towns or cities is natural because the services available there being of a superior 
nature are more expensive. 


5 per cent 


7.24. The average periods of confine: 
and 18 days for a rural mother and 


that she received less effective post-n: 


CHAPTER 8 


RECOMMENDATIONS 


dA Siven due consideration while planning 
similar studies in future. 


8.2. In a general health survey the main emphasis Obviously is on the collection 
of information on the frequency of the incidence or prevalence of illnesses (or injuries) and 
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classification by either individual causes or groups of causes 
EN ^. аа pattern, two conditions require t ca Tee ee 
m oF the ene qs and second, correctness of the classification by gie 5 f. 
nrbe pes ebd have indicated the possibility of some illnesses Беш i 
for some ee in ins ance, about 9 pon cent of the cases who had sought hos "i i 
А а ad been missed by the investigators. If this could be attrib Errem 
failure of memory of the respondent to recall the event, then it could reaso E E Ч: 
thst in а health survey of the general population in which a see b Td des о 
individuals go without any sort of medical attendance, illnesses are bis E С 0. 
a greater extent. Tf the households are contacted only once, some vau d | rae 2 
bê lost unless the period of reference is of a short duration. This would ined uh T 
insufficient coverage Over time. This difficulty can be got over by planning e KA 
g 


in such a manner as to make it possible to visit each selected household a number of tim 
mes 


at reasonably short intervals. 

ence of certain diseases exhibit a well- 

Tt is, therefore, desirable to spread the survey period over a ыз i ino 
ation would necessarily entail an inconveniently large number of vidis 
to the same households which might perhaps create practical difficulties. To avoid this 
it is desirable to divide the year into 4 typical seasons of 3 months each. The total sample 
of houscholds may also likewise be split up into 4 sub-samples and each sub-sample allotted 


to each season. 
84. This study has indicated th 
assess the extent of n 


tuberculosis is of à questionable nature. 
ng use of diagnostic facilities inc 


of such diseases. 


8.3. It is known that the incid 


pattern. 
a long period of observ 


at the value of a health survey by the usual question- 
norbidity with respect to diseases like pulmonary 

Reliance, therefore, has to be placed on prevalence 
luding laboratory tests for a proper evaluation 


naire method to 


surveys maki 
of the prevalence 
That a good deal of misre 
vey. Hence, in order to т 


+ is suggested t 
eases be simultancously attempted. 


porting of diseases occur is evident from the results 


make a reasonably correct interpretation of the 
hat a validity survey to assess the extent and 


8.5. 
of the Validity Sur 
results of morbidity returns, i 
of misclassification of dis 


nvestigation of this kin 


direction 
d the personal error of the investigators is none- 


8.6. Inanir 
theless a major factor in determining the reliability of the results. An internal check of the 
sample which takes account of not only the sampling variance, bit: also Vis personni erron 
of the investigators is, therefore, necessary to establish the reliability of the final estimates. 

le into a series of interpenetrating sub- 


This is easily provided by dividing the entire samp: 


samples. 


8Л. Socio-economi 
stratifying urban populations. 


c factors like education and oceupation have been found to be 
However, they have their limitations 


For a social stratification of rural households, therefore 
deration such factors 2s size of holdings or other suitable 
e actual living levels of the households. 


useful criteria for 
when applied to гига | 
it may be desirable to take into consi 
economic characteristics closely related to th 

8.8. This pilot survey comprises of about 1200 rural households selected from about 


3.8 million households in rural West Bengal, ie. Опе in every 3,000 households. As far 
dity rates were concerned, this sample was found to be adequate to give fairly 


l populations. 


аз gross morbi 
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precise estimates. But for a detailed analysis with finer breakdowns, the sample size proved 
to be inadequate. 


8.9. If a nation-wide morbidity survey is attempted and the same sampling fraction 
as above is maintained, then it would be possible to obtain precise estimates of morbidity 
rates by finer breakdowns at the all-India level, and at the same time State estimates of gross 
morbidity rates could be obtained with fair degree of precision. 


8.10. The urban sampling procedure requires modification to suit the special features 
of a general health survey to obtain a more economical design. However, it may not be 


possible to suggest an adequate sample size for the urban population until some more pilot 
surveys are conducted. 


8.11. General health surveys are not expected to provide adequate data for a 
detailed study of various aspects relating to maternity unless the sample is made unduly 
large. For such studies it is desirable that the sampling frame consists of households where 
births are known to have occurred during a recent specified period. 
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APPENDIX 1 


CoMPARISON OF TWO INDEPENDENT SUB-SAMPLE ESTIMATES 
N 
۷ 
1 


TABLE 01.1. 


INFANT А P 0 7 
MORTALITY RATE PER 100 LIVE BIRTHS FOR TW 
9 TWO 


SUB-SAMPLES 
sub-samplo 1 
„ы р sub-sample 2 combined 
: e 
ANE CMM MEM ety 
births rato irt züortelity fi М саа 
Т births мо Ivo mortality 
= rate * 
(2) (3) (4) (5) (6) 
(7) 
1. rural 2,909 173.60 2,630 164.26 
, | ee: RM 5,939 169.16 
. urban 911 155.87 934 116.70 1,845 
‚845 186.04 
TA Dj 2 Ё Di PES Е 
ABLE 01.2. INCIDENCE RATES FOR ACUTE DISEASES CLASSIFIED ACCORD. 
OR TWO SUB-SAMPLES j s 


TO DISE 


ASE GROUPS F 


ineidenco rato per 1000 population in a year 


disease group Faral 
urban 
usb-sapmlo sub-sample combined sub 
Е -sampl к B 
NEED ТЕ 1 3 1 ple sub sangle combined 
(1) (2) (3) (4) 
e (6) 
(7) 
group I—malari 38.81 54.44 46.28 30.94 à 
due 3.63 
group II—dysenter 16.82 31.52 26.54 PFA " 14.19 
m 38.70 86.26 
group III—diseases of the 6,28 
digestivo systom 1 49.16 30.08 40.15 104.72 48.94 
E es i 74.10 
group IV—other infective and 
parasitic diseases? 7.76 12.89 10.21 24.43 23.03 
р 23.0: 3j 
group v—moasles, mumps; small dp 
pox; chicken рох 31.05 20.06 25.86 34.90 11.52 
? өз 22.07 
group yi—rospiratory diseases? 129.00 153.28 140.87 160.57 195.77 = 
` e 179.72 
group vil—ey®, oar, boil and 7 
abscess, cellulitis and dental 
diseases 19.41 31.52 25.18 62.83 14.40 
n 5 36.26 
group vIII—othor diseases* 11.64 14.33 12.93 27,92 43.19 
sin 36.26 
total 303.65 348.12 328.02 481.69 374.27 4 
2 22.51 
1 Diarrhoea, enteritis, ete. 
2 Typhoid, cholera, diseases duo to helminths ete. 
3 Common cold, influenza, pneumonia, bronchitis, tonsillitis ote., including fever: 
4 Aneamia, v.d., vascular lesions affecting central nervous system, rheumatic " 
malformation accident etc. ‚ fever, congenital 
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ING TO DISEASE GROUPS FOR TWO SUB-SAMPLES 


disease group 


[Panrs 1 & 2 


prevalence rate per 1000 population 


rural 


urban 


sub-sample sub-sample combined 


sub-sample sub-sample combined 


1 2 1 2 
а) (2) (3) (4) (5) (6) (7) 

1. [group I—tuberculosis (pulm.) 0.96 2.47 1.68 2.78 4.59 3.77 
2. group II—diseases of the circula- 

tory and nervous systems! 3.19 4.24 3.69 3.71 1.53 2.52 
3. group III—diseases of the eye, 

ear, skin, bones and joints. 3.83 4.24 4.02 4.64 5.36 5.03 
4. group IV— diseases of the 

Stomach and duodenum except. 

cancer 2.55 2.83 2.68 5.57 6.12 5.87 
5. group V—asthma 3.19 3.89 3.52 5.57 3.83 4.61 
6. group VI—diseases of the genital 

organs 3.51 2.12 2.85 4.64 4.59 4.61 
7. group VII—other diseases? 2.55 1.41 2.01 8.34 8.42 8.39 
8. total 19.78 21.20 20.45 35.25 34.44 34.80 


1 Arteriosclerotie and degenerative heart diseases, 


psychoneurosis, diseases of nerves ete. 


? V.D., cancer, diabetes, avitaminosis, 
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congenital and functional diseases, ete. 


hypertension, rheumatic fever, diseases of veins, 
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APPENDIX 2 


INDIAN STATISTICAL INSTITUTE 


A PILOT HEALTH SURVEY IN 
4 E. N West BENGAL: MAR 
E! : MARCH-MAY 1955: HOUSER 
955 : OLD SCHEDULE l.l 
Instructions to Investigators 


Item 3. Household means of livelihood—As in industry—occupation code list 
е list (six 


1. (Block 2): 
digit code) 
11. Item 4. The monthly expenditure per capita is to be worked out by first ascertaini 1 
Bong Е р rtaining the 
ds and dividing it by the number of members of 


monthly household expenditure on consumer goo 


tho housohold. 
Roligion—Hindu-0; Muslim- 
i-0; Hindi-1; Urdu-2; 


]2. Item б. 1; Sikh-2; Christian-3; Tribal-4; Others-5. 
Nepali-3; Tribal-4; Punjabi-5; Others-6. 


Mother tongue—Bengal 


13. Item 6. 
14. Item 7. Purdah © ү i 
Us " d Ж pe 5 pem women in the household do not observe purdah, enter code-l, 
15. Item 8. Informant's relation to head—head-0; spouse-l; son-2; daughter-3; father-4; 
mother-5; brother-6; sister-7; other relation-8; non-relation (household member)-9; О 
16. Item 9. Informant’s ability—poor-0; average-l; good-2. 
unwilling-1; indifferent-2; helpful-3. 


llingness—hostilo-0; 
date and signature. 

consumed during the last 30 days in seers for items like rice, wheat 
milk, meat and fish (if the quantity is less than a a 
number and for fruits and vegetables enter values 
two items is from home production impute 


17. Item 10. Informant's wi 
2. (Block 3): For each of the four visits enter 
3. (Block 4) : Enter quantities 
oil, sugar and gur, 


other cereals, ghee, 
f eggs enter 


For consumption 0 


enter-1). 
in rupees and annas. If consumption of the latter 
values at current local prices. 
de-1—pucen house with brick walls, code-2—all othe: 
2 m 


4. (Block 5) : Item 1. Type of house—co 


types of houses. 
Number 


41. Item 2. of rooms—includes all living rooms and excludes those used for bath, cooking 


and store. 
hose covered by verandahs if they 


well as t. 


der living rooms à$ 
Give the figure in square feet. 


Floor space—space uni 
the living rooms. 


purpose as 
4. Ventilation—If the smoke has no good outl 
enter er code-2. 


drinking, washing)—The © 


42. Item 3. 

are used for ihe same } 
43. Item let and there is no possibility of freo 
culation of air. 


Water ( 


code-1; otherwise enti 


odes are same for drinking as well as washing water. 


cir 
44. Item 5. 
code 2—tubewell water. 


]—tap water, 
codo 4—other types of water. 


code 


code 3—well water, 
45. Item 6. Latrine code— 
codo ]—sanitary privy: code 2—service privy: 


code 3—pit privy: code 4—others 
ation— 

drainage good and о] 
either drainage is not go 
covered with garbage and flies. 


General sanit: 
pen space 


od or lacks open space, 


46. Item 7. 
code 1—sorroundings clean, 
code 2.—sorroundings clean, 


code 3-—sorroundings unclean, 
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03.5. (Block 6): Who are the members of the family? 


(a) All persons who have lived in the household and eaten from the household kitchen for at least 
16 days during the month preceding the date of survey. 


(b) All children born within 14 days prior to the date of visit to members of tho category (a). 


(c) All persons dead during the period, 14 days prior to date of visit, who if alive, would have 
been elassed as (a). 


03.51. Column 2—Relationship—(three-digit code)—head-0; spouse-1 
mother-5; brother-6; sister-7; other relation-8; non-relation-9. 


; son-2; daughter-3; father—-4: 


03.52. Column 3—Sex—male-l; female-2. 
03.53. Column 4—Age—age last birthday. 


03.54. Column 5—Marital status—never married-1; spouse living but divorced or separated-2; 
married—3; spouse dead-4. 


03.55. Column 6—Nature of stay—For all persons present in this household throughout the whole 


year preceding the date of visit enter сойе-1. Tf the person has not stayed for tho whole year 


ask—(i) whether present on the date of survey. (ii) whether stayed in the household for most of 


last fortnight and (iii) whether stayed in the household for most of last year and enter codes 
as follows : 


present on date stayed for most of stayed for most of 
code of visit last fortnight last year 
2 yes yes yes 
3 yes yes no ы 
4 уез по уев 
5 уез по по 
6 по yes yes 
7 no yes no 
8 no no yes 
9 no no no 


For children below 1 year of age, enter code 1, if ever since birth, they were in this houschold. 
03.56. Column 7—Educational status—(two digit code)— 
Left hand digit—education, general : 
llliterate-1; literate but below primary-2; 


primary-3; middle-4; matric-5;  intermediate-6; 
graduate in science-7; graduate in arts-8; 


post-graduate in science-9; post-graduate in arts-0. 
Right-hand digit—education, technical : 
no technical or professional qualifieation-1; technical or professional skill only, without degree or 
equivalent diploma but with or without certificate or diploma of lower order-2; holder of 
equivalent degree or diploma in teaching: 


k; —3; engineering—4; agriculture—5; mei 
other medicine-7; veterinary— 


dicine, allopathic-6; 
8; law and commerce. 


—9; other technology or profession-0. 
03.57. Columns 14 and 15—P 


4 rincipal means of livelihood—As in industry-oceupation eode list (six 
digit code) 


03.58. Column 18—Weight code— code l—constantly losing weight, code 2—not constantly losing 
weight. 


03.59. Column 19—Temperature code—code 1—constantly feeli i 
ing feverish i r " 
03.5.10. Column 20—Health eode—ask whether the p 181 or rise of temperature 


erson is constant] li raad- mad 
constantly lacking appetite and enter code as follows : у feeling htigo 5 
constant fatigue ёё : 
ta: 

code 1 yes nstant lack of appetite 

code 2 yes yes 

code 3 no no 

code 4 no yes 

no 
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2 ip only in the first vist owever, z: 
1, 2 and 4 to 6 need be filled uj nl; the first visit. H rer, in subsequent visit: 
E isits some 


03.6. Blocks, 
be made in block 6 for additions and exits of household memb 
ers. 


alterations might have to 
Such entries may be made with an asterisk and footnotes given. 
03.7. E gy i 
(Block 7) : This has to be filled up only for such members of the household who were sick duri 
If a person fell sick more than once, for each sickness of the same ат 
erson 


the reference period. 
Cases of child birth, ihough medically treated should 


a separate line of e 
not bo considered as 


ntries is to be made. 

cases of sickness. 

Į—visit no.—for each sickness during the reference period of ihe Ist visit, enter l 
a er 


03.71. Column 
2nd reference period enter 2 and so on. 


and for each sickness during the 
vailing in the individual in this as well asin the previous 


03.72. Column 2—If any sickness was pre 
case not prevailing during the previous visit enter 


reference period enter code 2. If it is a new 


code 1. 
number of the affected person as entered in block 6. If he falls ill 


—Enter serial 
his number for every sickness of his. 


the reference period repeat 
isease as stated by the informant. If further particulars 
may be entered in column 15 


03.73. Column 3 

more than once during 
03.74. Column 4—Put down name of the di 
of the disease are given by the informant of his own accord, they 


meant for ‘remarks’. 
c io. they last for a long period of time and have 


no abrupt time of onset. heart diseases, diabetes, asthma, ete. Some diseases 
o and are of shorter duration with an abrupt time of onset and recovery if the result 
is not death. Examples are malaria, typhoid, cholera, dysentery. ete. It is also possible for 

ither acute or chronic. Another 


cortain diseases like malaria and dysentery to manifest as ¢ 
the stricken individual. If ihe 


so is the degree of disability it causes оп 
from performing the usual assignment of work the disease is 


affected person is not disabled 

non-disabling. И it prevents him from doing his usual work, it is called disabling, the latter in 

oxtremo case may necessitate confinement to bed or hospital. In the ease of old men, women 

g to school it may be difficult to distinguish disability from non-disability 
nfinement to bed or hospital. For these people 


and children not goin 
unless the former is of a degree needing CO 
eans taking of medicine or special diet. Based on these two aspects of the 

f disease ure given below : i 


disability simply ™ 
discase 6 codes for tho nature 0 
g—code l; disabling but not confined to bed or hospital—code 2; confined 


tal—code 3. 


03.75. Column 5—Some diseases are chroni 
Examples are T.B., 


aro acut 


aspect of à disea 


non-disablin| 
to bed or hospi 


g—code 4; disa 
tal—eode 6. 


Chronic : 
non-disablin: pling but not confined to bed or hospital—code 5; confined to 


Acute : 
bed or hospi 


n 6—Date of onset is the date on which disability starts—for all non-disabling diseases 
chronic) insert & dash in this column. For non-working persons date of onset is the 
medical treatment or special diet starts. 


03.76. Colum 
(acute oF 
date on which 

sability ceases. If the person recovered 


date on which di 
March, enter D—15th March, the letter 


e of recovery is the 


03.77. Column траў › 3 
оп 15th March enter в—15һ March and if he died on 15th 
в or D preceding the date indicating the result of the disease. If the illness prevailed on the date 
of visit enter code ‘P’. 
03.78. Column g—Period of sickness to be entered in terms of months, days. 
If the caso Was attended by an allopath enter 


Code for type of attendance. 
ended by homeopath or вушу 
ded by none enter code 6. 


ed ог unani Or quack enter codes 2, 3, 4 or 5 respec- 


atten! 
jans such as those for consultations, visits 
z a 


in col. 10 and all expenses on medicines to he 
for tonics. hospital rent, fees paid to 


а io physic 


12—All fees pai 
d entered 


03.7.10. Columns 10 
o to be lumped up an 
uch 


operations, ete., аг 
11 and all other expenses S 


ontered in col. 
199 
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nurses and dhais ete., to be entered in col. 12. If the amount expended by a certain household 


is to cover more than one case of sickness it is necessary to allocate the respective proportion 
to the individual cases of sickness. 


03.7.11. Column 13—Why medical care was not availed? 


codes 1—no hospital or private physician available, 2—too expensive; 3—no faith in treatment; 
4—sickness not serious; 5—other reasons. 


03.7.12. Column 14—Reference period: for first visit the reference period 


for subsequent visits the actual number of days reckoned from the date 
must be entered in the column. 


is always 14 days. But 
of last visit to this one 


03.8. (Block 8): The entries in this block pertain to only those 


women who have been delivered during 
one of the reference periods. 


As the survivalship of children born during any reference period has 
to be observed till the termination of the entire period of survey it i 


8 necessary to enter this 
item of information again in visits subsequent to the one in which the 


live birth was noted. 
03.81. Columns 1 to 3: as in block 7. 


03.82. Column 4—There are three types of termination. 


The codes are as follows : codo 1—live birth, 
code 2—still birth and code 3—abortion. 


03.83. Column 5—Date of delivery should be entered irrespective of the nature of termination. 

03.84. Columns 6 to 8—If entry in col. 4 is code 1, then enter sex code of the child in col. 6, survival 
code (surviving-1, dead-2, left the household and not likel 
left the household and is likely to return before th 
(months, days) at present or at death or 
required in subsequent visits also. 


y to return before the end of survey-3. 
he end of survey-4) in column 7, and age in 
at departure in col. 8. For columns 7 and 8 entries nre 


03.85. Column 9— Place of delivery, code 1— delivered in this household, code 2— delivered in another 
household within district, code 3— delivered in hospital within 


district, code 4—dolivered in 
another household outside district, code 5— delivered in hospital outside district. 


type: code 1—doctor, 2—midwife or nurse (qualified), 3—dhai, 
» 5—none, includes attendance by relatives and friends. 


03.86. Column 10—Attendance 
4— hospital. 


03.87. Column 11—Period of confinement—enter either period of hospitalisation or if homo delivery 
enter period of bed-days. If on date of visit the woman is still lying on bed enter code ‘bed’ and 
in the next visit enquire again about the total bed-days and enter this item alongside of entries 
in columns 7 and 8. 

03.88. Column 12—Period of convalescence—this mea: 

If the woman is still 

col. 11. 


ns the period of di. 


sability following bed-days. 
ode ‘conv’, 


convalescing enter с The method of éiitry-is same ав in. 


03.89. Columns 13 to 16—Cost of medical care is Split into four parts—e. 
towards physicians’ fees for consultation, visit, operation ete, 
ordhai. Column 15 gives expenditure on medicine and colu 


03.8.10. Column 17—Reference period—enter as in block 7, 


olumn 13 gives expenses incurred 
Column 14 gives fees paid to midwife 


mn 16 gives cost of hospitalisation. 


03.8.11. Column 18—Survival of mother—code 1—mother 


у f alive, code 2—mother dead. 
03.8.12. In block 8 if any woman has given birth to twing two cong 


secutive lines must be entered. 
03.9. (Block 9): Every woman married, widowed, 


block. Her serial number 


ed in column 1, 
03.91. Column 2—Age at present should be copied from 


03.92. Column 3—А, 


т) 


| 
| 


03.94. Columns 


col. 2 
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11 io 33—These give the age of the moth 
Вне е ther and the inati 
бек ie € Sie E If 8 d termination occurred kc uem Mas 
ei denar ces sd ^4 Les of termination should not be given in the col рж е wi 
о 33 must be entered. Column 31 gives the order of rw ни м 
йт" E. aes in which: the termination took place and pC 
e of visit, code 2—still birth, code 3—abor i 3d 
put the child is dead then enter ‘D-N’ sore м’ m d 
pleted months of life (say, if the child died after 3 months of lif s oe ye 
iormination took place at least one year ago, then entries in col b Ес. 
Such a termination should be entered as the previous due а кү us die 
of termination and against result enter codes she og Ld" des 


tion, column 32 gives the calend 
code I—live birth, 
tho result was а live birth 1 


made. 
of mothor at the time 
code 1—live birth, and child survived first year of life, 
code 2—live birth, child died within first month of life 
code 3—live birth, child died after one month but before one year of it lif 
3 its life, 


code 4—still birth, code 5—abortion. 
Example 1: (1) Woman's present age-35, (2)age at marriage-18. (3) total terminati 2 
(4) 1st termination oceurred nt age 20, child died at 6 months of life, (5) 2nd t 5 ei 
Я 2 f 
occurred at age 22 and resulted in still birth. атынд 


man's age at present is 35 and her last termi i у 23. i 
mination occurred at 22, 1.е., 13 years 


Since the wo 
ans 31-33. 


ago no entries ато needed in colum 


"Tho entries аге: 
col.2 col. 3 col. 10 col. 11 
35 2 20 


col, 12 col. 13 col. 14 col. 31 col. 32 col. 33 


22 = 


2) age at marriage 20, (3) total terminations 4,(4) lst 


present 28, ( 
of life, (5) 2nd termination at age 


: (1) woman's age at T 
ı child survived first year 


Example 2: 

termination at age 22, live birt] 

24, child died before 1 month of life. (6) 3rd termination at age 26, still birth, (7) 4th 
28, child died after one month but before 1 year of life. Since the last 
28 and the woman's present age being 28 this termination 
Ask for the calendar month of this termination and 


ths. Suppose the answer is ‘May, 1954' and age at 


termination at age 
ination occurred at age 
red within last year. 


completed mon 
Then the entries are =~ 


term 
must have occur 
the age at death in 
death is 6 months. 

1. 11 col. 12 col, 13 col. 14 col. 15 col. 16 col. 17 col. 18 = 31 col 82 E 
^us um NM SEM ool; OL hou 


col. 3 col. 10 co 
4 


28 20 
d only in respect of terminations 


formation on ante-natal care is to be collecte 
tho last 1 year. 
indicating the type of attendance— 

welfare centre-2; qualified praetitioner-3; qualified midwife-4 


attendance. 
ndance, only the predominant type need be recorded. 


Column 34—In 
g during 
ode—left hand digit 
hospital-1; 


oceurrin 


Two-digit © 
no attendance-0; 
number of such 


hand digit— 
han one type of atte 


Right- 
re t 


If there is mo 


Vor. 


APPENDIX 2 
INDIAN STATISTICAL INSTITUTE 


WEST BENGAL HEALTH AND EMPLOYMENT STUDY! MARCH—MAY 1955 
Household Schedule 


1. district Я 
2. thana à 


3. village 


4. sample thana no. 


5. sample village по. 


6. sample В.В. *по. 


(2) investigation particulars (4) diet д 5) housing condition 


at [date] eee 


2. number of rooms 
oit | 3. floor space 


^ 


. ventilation 


5. water drinking washing 


relationship 


1 Employment portion not appended here. 
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ON RECALL LAPSE IN INFANT DEATH REPORTING? 


By RANJAN KUMAR SOM 
and 
NITAI CHANDRA DAS 


Indian Statistical Institute, Calcutta 


эзо as а distorting factor in infant death (and sex ratio at birth) reporti 
es, based on the interview method at a current moment, was introduced in th ue 
demographic situation by Mahalanobis andDas Gupta (1954) and elaborated by Das Gupta, So: с 
and Mitra (1955) in Couple Fertility; the approach in these two studies, based on the National ec Rem 
datn, was through the marriage cohorts where time entered as a distinet component. The а a 
me was seen to get substantially distorted from that of the true оные. 


observed proportions over ti 
external evidences, establishing thus the existence of a definite, progressively decreasin 
2 g 


eath reporting. 

Biswas, and Chakravarty (1959), anal; 
nd Health and Employmen: 
as not statistically significant. 
he recall lapse in infant death reporting by the marriage 
h differential and analyzing the same data as that 
e findings of the previous two studies. 


SUMMARY. Recall lar 


in historical studi 


obtained from 
recall lapse in infant d 
Poti, Raman, 
Household Comparat ive Study а 
ath reporting W: 
of the study of th 
y the order of birt! 
ohorts, eonfirms thi 


lyzing the data on infant death of the West 
Bengal + Study 1955, by order of birth concluded 
that recall lapse in infant de The present paper seeks to 
establish the theoret ical validity 
ort. differential vis-a-vis that b; 


coh 
y marriage с 


utilized by Poti ete. b; 


1. Recall lapse as & constituent, distorting factor in infant death (and sex ratio at | 


orical studies based on the interview method at a current moment 
Mahalanobis and Das Gupta (1954) for the Indian demographic situa- 
e National Sample Survey (NSS). In Couple Fertility by Das Gupta, 

lso on the same NSS data, this problem was 


based à 

e exponential curves fitted, with good agreement, for 
t deaths by marriage cohorts. 

„th order of births, of which 


of the i-th order 


birth) reporting in hist 
was first introduced by 
tion, obtained from th 
Som, Majumdar, and Mitra (1955), 
studied in greater details and tentativ 
the percentages under 

2. Let m, mothers mar 
first year of life: 


-reporting of infan 
endar year t have by, à 


‘ried at cal 
dy die in their then prn, the infant death proportion (IDP) 


of birth to such mothers is 


Pi = dafi: 
Jlected through inter 
and dii respectively are reported, giving 


view at а current moment, 


the past fertility history is co 


of births and infant deaths Di 


IDP "S 
pu = dit: 


In a survey, where 
defective number 


the corresponding 
ates in infant death reporting so that the trend 


se oper: 
t substantially distorted from that of 


find out if recall lap 
time might ge 


The problem is to 
of the observed pr over 
the true proportions (р). 


3, In the study by 
topic was studied by the marria 


over all birth orders is нь ей 
ха; в: = E bu pul? by; = pr. SAY- 
ES i 


'oportions (p) 
pta and also in Couple Fertility, this 


obis and Das Gup 


Mahalan 
rential. The observed ТОР for the m, mothers 
i 


ge cohort diffe 


) was then made to show that signi- 
To 


sith р. (т being more recent than t 
ng. since the observed proportions gave the 


A comparison of Pt. en n 

à ird PSU 
fi recall lapse existed in и ant death repo" Р ti 
cant recall lap nal evidence it could be stated that 5,, < Pi, . (In the 


relation P, P. " 
ation p;, > Pt py marri 


above two studies, 


while from exteri 
the IDP was апа 


lyzed fo age cohort groups and not for any 


E A rejoinder to «А pilot health survey in West Bengal”, by Poti, Raman, Biswas and Chakravarti 
1A rejoin г = 


(1959), Sankhya, 21 141-204. 
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particular marriage cohorts; we can, however, assume without any loss in generality that ¢ 
and 7 are the central points of two marriage cohort groups). 


4. The problem of recall lapse in infant death reporting has recently been referred 
to by Poti, Raman, Biswas, and Chakravarti (1959) in “А pilot health survey in West Bengal, 
1955”, based on the data of the West Bengal Household Comparative Study and Health 
and Employment Study, 1955! published in this issue on p. 141-204. Here the study of the 
recall lapse in infant death reporting was limited to the different birth orders from mothers 
having five or more terminations. The observed IDP for the i-th birth order over all 
mothers (and marriage cohorts) is 


2 р bj = 2 bi pul bj = р. say. 


Thus, a comparison between the TDP’s Da and P. for two birth orders i and г would not 
reflect the effect of recall lapse over time (apart from the small interval between successive 
priths which is of the order of only 2-3 years). The observed IDPs by orders of birth for 
the West Bengal Health Survey can not then, as the authors made them out on the basis 
of their Table 4.1, contradict the existence of a substantial recall lapse over time between 
marriage cohorts, observed in the previous two studies on a national scale. 


5. From the same data as that of Table 4.1 of “A pilot health survey in West 
Bengal", the IDPs for the different marriage cohort groups calculated by us are presented 
in Table 1. This table, which has a counterpart in Table 8 of Mahalanobis and Das 
Gupta (1954) and Table 8.1 of Couple Fertility (Das Gupta, Som, Majumdar, and Mitra, 
1955), on the other hand shows clearly the existence of similar recall lapse over time in 
infant death reporting except for small kinks (occasioned perhaps by the small sample sizes), 


and confirms the findings of Mahalanobis and Das Gupta (1954) as also of Das Gupta, Som 
Majumdar, and Mitra (1955). à ; 


TABLE 1. INFANT DEATH PROPORTION (PER 1000 LIVE BIRTHS) FOR EVER-MARRIED 
WOMEN HAVING FIVE OR MORE TERMINATIONS BY MARRIAGE COHORT 
GROUPS: WEST BENGAL HOUSEHOLD COMPARATIVE STUDY AND 

HEALTH & EMPLOYMENT STUDY, 1955 


Ss 


marriage cohort rural urban 
(1) (2) (3) 
1. before 1910 147 99 
2. 1910-19 155 158 
3: 1920-29 200 180 
4, 1930-39 183 115 
5. 1940-45 188 221 
6. allmarriage cohorts 169 140 
(no. of mothers) (522) (165) 


Es In this survey the approach was through the living married 
wom vhi А 
the findings of Couple Fertility, miss about 12 en which would, on the basis of 


per cent of the broken couples wi i 
of the differing approach adopted on the recall lapse in infant death ае ATE ce a к 
j ever, not being dis- 


cussed here. 
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West Bengal”, the recall lapse in infant death 
the births to these mothers 
g the date of 


6. In ^A pilot health survey in 
hers aged 43 years or over. 


reporting was also studied for mot 
being divided into two groups—(i) pirths occurring within 15 years precedin 
survey; and (ii) births occurring before 15 years preceding the date of survey. The IDP 
e greater for birth group (ii) than that for (i), from which also the authors 
apse was not statistically significant. 
7. е finding was based (Table 4.2 of “A pilot 
health survey in West Bengal") were analyzed in this note by marriage cohorts. For births 
ate of survey, the sample sizes were inadequate 
For births occurring 15 years or earlier preceding 
2 by marriage cohort groups. From 
age cohort shows up the existence of 
thers aged 43 years 
s within each 


was seen to b 
concluded that recall 1 


7. The same data on which the abov 


occurring within 15 years preceding the d 
individual marriage cohort groups. 

the IDPs are presented in Table 

1 that the analysis by marri 
The division of the births for mo 
analysis over all marriage cohort: 
d to show up the effect of recall lapse in infant death re- 
for births occurring within 15 years preceding the date 
er cent in the urban relate to the first 
Jier preceding 


for the 
the date of survey, 
this table, it will be seer 
se mothers also. 


recall lapse for the: 
birth period and 


or over into two groups by 
birth period group is not expecte 
porting as for the first group, i.e., 
9 per cent in rural 
proportion fo 
1 rural areas and 71 per cei 


areas and 12 p 
r births occurring 15 years or ear 
nt in the urban. Thus the 
favour of the earlier orders of births, just presents 
for the earlier orders of birth in this particular 
rs; this was, in fact, à general feature which 


of survey, about 
four births : the corresponding 
the date of survey is 63 per cent ir 
second group, being loaded heavily in 
again the finding that the observed IDPs 
hose for the later orde 


study were higher than tl 
horts for all mothers. 


ran through all marriagé co 


(PER 1000 LIVE BIRTHS) FOR BIRTHS 
+ 15 YEARS OR EARLIER PRECEDING THE DATE OF SURVEY TO EVER- 
R, BY MARRIAGE STUDY 


WEST BENGAL HOUSEHOLD COMPARATIVE 


COHORT GROUPS: à 
IPLOYMENT STUDY, 1955 


& HEALTH AND EN 


rban 


rural ui 


нЕ 
06 


1. before 1910 140 1 


2. 1910-19 172 175 
3. 1920-29 180 145 
4, all marriage cohorts 160 137 


(no. of mothers) (359) (153) 


9 with a very small sample size. 


1 Including tho marriage period 1930-3 
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8. The fallacy which crept in the arguments by the authors of “A pilot health Suey 
of West Bengal” was inherent in the assumption that recall lapse in infant death tapori 
was order of birth-specific, and not mother-(or marriage cohort-) specific. In the previous 
two studies, the same group of mothers did not appear more than once in the differentials 
studied whereas in this particular study they do and vitiate the homogencity of data. For 
either substantiation or invalidation of the phenomenon of recall lapse observed in the former 
studies, the same line of analysis should have been followed here also : even then, a regional 
survey cannot always show up the same features as may be observed in a study on a 
national scale, which is expected to smooth out regional peculiarities if they exist. 
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THE ANALYSIS OF HETEROGENEITY. I. 


By J. B.S. HALDANE 
Indian Statistical Institute, Calcutta 


SUMMARY. Estimators of the mean and variance of a frequency are given when this frequency 


varies through a series of samples. 


INTRODUCTION 

gical research, and doubtless 
observations are made under 
Each of them leads to the production of a 
be classed into two types which we may call successes 
y be females and males, fertile and sterile matings, sur- 
Tf we have п samples, and the i-th consists of 5; members 
are failures, we can draw up а (2x n) fold table and 
y. Ifthe test is judged compatible with 
hesis that the probability of success 


quently arises in biolo; 


The following situation fre 
ber of experiments or 


anches of science. Anum 


in other br: 
conditions as possible. 


as nearly similar 
sample, whose members may 
and failures, though they ша 
vivals and deaths, and so on. 

of which a; are successes and b; 
apply the X? or some other test of homogeneit; 


homogeneity, We can adopt the simple hypot 
тъ т 

п each sample, and estimate it as p => а; E 5. If however the 
i= = 


ficant of heterogenei 
another, its value in 


ude that p has varied from 
nt being p. We can then 
bove is unbiassed, 


was the same i 
ty, we must conel 


the i-th experime 
Although the estimate given а 
t unless the sample number 5; is constant. We 
and higher moments of p. While 4® is a 
fit, but we shall see that the estimate of 


test is judged signi 
ent to 

ate the mean of pi 
that it is not efficien 
he variance 


one experim 
proceed to estim 
it will be shown 


can also in general estimate t 


is not à measure О: 


test for heterogeneity it i 

the variance of p 38 related to №. In this paper 1 shall only deal with the estimation 
of the mean and variance. 

his problem has arisen in two rather different contexts. 

series of litters of mice or other small 


In my experience t 


On the one hand we may have to analyse @ 
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animals produced from parents as similar as possible, under standardised conditions. 
'The mean value of s is of the order of 6, and for most values of s there is a fair number 
of samples. If heterogeneity has been detected, we can estimate the mean value of 
p for various sample numbers, and see whether they regress significantly on s. If 
they do not, we can weight each with the appropriate amount of information, and 
combine them. If our data are numerous enough, 
moments. 


we can do the same for other 


On the other hand we may have to deal with a series of insect or plant families, 


in which the mean value of s is about 100, and two samples with the same value of 
s are unusual. 


As Robertson (1951) pointed out, this problem is the inverse of the problem 


studied by Lexis (1877). Lexis considered the effect on the variance of а; of a known 
variance of p;. 


In what follows I use k, to mean the r-th cumulant of the true distribution 
of p. k, means an unbiassed estimate of к, and к(7%) the expectation of the s-th 
cumulant of the distribution of k,, while k(r*) is an unbiassed estimate of к(”*). We 
have to consider expectations at two levels. I denote expectations for a given 
value of p;, and thus within a single sample, with an asterisk. Thus @*(а;) = p;s;, 
&*(a2) = pis,(s;—1)+-p;s;, &*(a; 6) = pil—p,) s(5;—1) and so on. I denote геад 
tions within the whole group of n samples without an asterisk. Thus &(p)= &(р;)= 


n 
I use Xa or Xa, to mean X aj and so on. If з is constant &(Z а) = kQns. If sis 
i=1 


variable I assume that s; and p; are uncorrelated, though this should be verified 


where possible. In any case I assume that p; and р; are ЕЛ, that is to say 
&(р; p) = кї, if +=). Also 


ё(р) = Ky El j= = Kîka. 


SAMPLES OF CONSTANT SIZE 


If every sample consists of s members, then since &*(a;) = pis, so 
&(Xa) = күпз, whence 


ky = (ns)1X а. 


This estimate is clearly unbiassed and efficient. ш 
(2a)? == Satta У dd. 
iow 

So ELE a] = ns(5—1) Elp?) +s ар) аа туе арр) 

= ns(s—1) (кк) ив Ky-En(n —1)st 

= ns(ns— Y) -ns ky--ns(s—1 1)к,. 
But [6(> a)? = n?s2k?, 
80 var (Xa) = ns(k— к) 4-ns(s—1)«,, 

Ki(l—k — 

whence K(12) = var (К) = X ы De, o ow @ 
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The first term is the compon КЪ hi hi 
ро: ent due to th i W. n 
-— 3 А : e small samp. i d 
( —6 )к», is due to the variance of P. The second ак nora 
| y y exceed the 


first. 
XaQXEd—IEaboeÀab. 
i im 7” 


S 
o &[® a; Eb] = ns(s—1) (ку Id к-та а) 
= ns(ns—1)k(1—k)—ns(s—1)Ks- 

Also &[E a; Ы = ns(5—1) (к,—кї—к„). 
Непсе &[(8—1)®@ Xb—(ns—1) £ ab] = n(n—1)s(s— 1)k 

gi 
ie pa (s—1) Za X b—(ns—1) E ab 

mn—1)5(8—1) eo) 


is an unbiassed estimate of К». Also 
XaXb—Xab 
& Ё HET = БЫ | = er 
n(n—1)s кїч; 


an put (2) їп terms of observed quantities. 


So we ¢ 
ё урп Dab _ Cov (a, b 
k12) = л а= Аб" ا سے‎ ) 
qn п (п—1)8° (n—1)s? ° . es (4) 


Robertson (1951) gave an expression for the variance of p which in m; 

symbolism becomes : И 
ka = [n s s—1)]^ [(s—1)Ea Xb—ns Lab]. 

to my expression (3), the difference being the value 

mall the difference is not negligible. For example 


Unless № is small, this is very near 
4, 8, 10, 14, 17, then Xa = 53, Xb = 447 


(4) of k(1). However when ” is 5 
if s = 100, n = 5, and the values of а are 


У аф = 4635. № = 0.106, an 


е standard deviation of р, 


If my own value is judged to 


a test of homogeneity, 
ns(Za £ b—n Xab) 


Xa = Xa Хр 


stant, its exact expectation is 


d expression (3) gives 0016436 for the variance or 


while Robertson’ 
be more асси 


в expression gives .00112763 


.04054 for th 
ate, it should be used. 


and .03358. 
xiv used as 


may be written 


= 0, that is to say p is con 


When Ks = 
ala) = (ns— 1) n(n—1)s. 
So gia (ns 7007 D = [(ns—1) 7% хи 1)88— 1), 
h= ai 1)5(6— D) ... (5) 


ог 
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Since x2, can be zero with a finite probability, but cannot be negative, it 
follows that k, can be negative, its minimum value, if p is constant, being 
—p(1—p)(s—1)*. The null sampling distribution of k, is given by that of W 
and the significance of a positive value is that of the corresponding value of 4?. m 
k, is negative we must suppose that к, is zero ог too small to estimate, while 


drawing the appropriate conclusions from the small value of X 


An—1l: 
We now see that x?, as a test of homogeneity, has а triple function. Firstly 
its excess over its null value furnishes a test of whéther the variance of p exceeds zero. 
Secondly it allows us, by means of (8), to estimate the variance of p. And thirdly 


it measures the uncertainty of the estimate of the mean of p. For(4) may be written 
as 


var (kı) = [n3(n—1)s9}-1y2_, Xa Eb. = (6) 


Workers are rightly suspicious of a mean based on a heterogeneous set of 
samples. (4) or (6) tells them just how suspicious they should be. I may add that 
if №, is calculated by the method of Haldane (19 


55) which, it is claimed, saves 
& good deal of computation in some cases, (5) and (6) 


are more useful than (3) and (4). 
(8) is analogous to the well-known relation between 7 and y? for а (2x 2)- 
fold table. I hope to give estimators of Kg, ка, 


and of the sampling variance of ks, 
in a later paper. 


The latter is not however of immediate importance, since we h 


ave 
an expression for the significance of a given value of k,. 


SAMPLES OF VARIABLE SIZE 
If we have a large number of samples for each of a few sm 
with human families, we can treat each Set for a given value of s separately, and 
combine the estimates of p, the amount of information for each value of 8 being given 
by (2). When however, the values of s are all or mostly different, we proceed as 
follows. 


all values of s, as 


If w; be any weighting factor, then, 


(Zw, а) = куш, 8;. 
So provided Xw,s, = 1, Ума; is an unbiassed estimate of к.. 
one-valued function of в. Clearly also when Ky = 0, that is а 
should be constant, and therefore equal to (Zs), 
w should be an increasing function of 8. On 
ов; — 1-Е kak (1—k,)]). which follows, directly 
argument. But it can be derived more rigorously 


Clearly w must be a 
Say p is constant, 20; 
When however Ky is not zero, 
9 càn derive the expression 
from (2) by a somewhat intuitive 


as follows, 
The most efficient form of 


Ws 
kı = Уша. Now 
i 


is that which minimizes the variance of 
- EM 
4 = У wia4-2X X wawa, 
i ре UE 
i jæi 
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Sin 
po &[a?] = («2+ k3)s(5;—1)4- a5; 
and elaaj) = 188 
it foll 
ows that &8] = X wis (и (kik) 22 S ID 
Р D UR 


= «(> ws) +K (1— ку) ив; Ka wis: (S — 1). 


But Xwjs; = 1, so on subtracting кў we find 
var (kı) = хк «ет 


Since У 10,8; = 1, this is minima whe —к? -1]- 
Si 2 N 20;8;06 Kat (Kı кү K3)S; 1 


or w; = [kas Hk — d — s] DEK + (кз). 


So if 


c = к1(1—к1)к 1—1, wo (site), 


and k= 
NE 
sa ... (8) 


var (k) = к(1?) = ES 
ste 


onstant, c is very large, and k, approxi- 


If x, is small, that is to say р is nearly e 
y assume the values of zero and unity, 


mates to (£s)! Za, as is obvious. Ifp can onl, 
c= 0, and ky = q31Xas3. Otherwise the values of c are intermediate. Thus if 


all values of р from 0 to 1 are equally frequent, c — 2, and if all values from 0 to + 
are equally frequent, с = 20, and so on. Usually therefore it will be necessary to 


estimate Ks. 
Prof, C. R. Rao has pointed out to me that the estimate (7) is not quite 
пав 10 general estimates of Ky and ко are correlated. However the bias 
of kK, can only be given when 


The most efficient estimator 
so that formally the problem is very complicated. But 
d estimators of к» can be derived, according to the 


The weight to be attached to any sample will 
function of sample size will be somewhere 
small, and that appropriate when 
number of expectations, including 


unbiassed, b 
will seldom be large. 
higher moments are known, 
an infinite number of unbiasse 
weights given to different samples. 


always increase with s, but the weight as à 
between that appropriate when s is large and Ks 
We can write down any 


s is small and Ко large. 


the following : 
[Ss ab] = (К 


&[(8—1 y?ab] = (1— 
&[2s(s— 1)71ab] = (к ,—d- n 
&[Za X] = Ka~ k,)(Zs—1)Zs— k (Xs —Es) 
anata сас ЖА 
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From-these we can at once derive a number of unbiassed estimates of Ka, 
of which the most likely to be useful are : 


j^ (Zs—n)XaXb—(Xs—1)Xs Z(s ab) 
(Es—n)(Zs)? —Xs?] 


__ XaXb[(Xs—1)y2.i—(n— 1)28] "E 
ү ib (Zs—n)((Zs)?—Xs?] 
XaXb—(Ys— 1)Z(s— 1)-1ab 
Kop = 


Cy Xs PEE 


nXs-laXis-1b — (n? — Xs1)X:s-1(s—1)-1ab 11 
ka, = п ®—1) й sow (b 


Of these estimates Ka, and kag should differ very little, and ky, is the easiest 
to compute unless x7, has already been computed, which however will usually be 


the case. It will be seen that ksa and kag have about the same weighting as X?, while 
ko, assigns approximately equal weight to each sample. 


From the example which follows it will be seen that these estimates may be 
very close to one another. Indeed k,, and kag agree to four significant figures. They 
thus furnish a fairly precise estimate of «,, which, in turn, allows an accurate estimate 
of кү, and of the sampling variance of this estimate of K}. 


TABLE 1, RECOMBINANTS IN 13 CULTURES OF 
DROSOPHILA SUBOBSCURA 


тт“ 


H b b p'-—81a 
——— О 
224 69 155 -30804 
206 59 147 .28641 
255 70 185 -27451 
267 70 197 -26217 
247 61 186 .24696 
238 57 181 .23950 
166 36 130 .21687 
Dm 2 157 .21106 
210 89 m 18571 
284 50 234 -17608 
190 33 157 17368 
187 32 7 155 ATE. 
243 40 203 16461 
2916 658 2258 diei 
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THE ANALYSIS OF HETEROGENEITY. I. 
A NUMERICAL EXAMPLE 


In Table 1 the successive values of s are the numbers of imagines of Droso- 
phila subobscura in 13 bottles each derived from a single pair mating. In each bottle 
the father was homozygous for a pair of autosomal recessive genes belonging to the 
same linkage group and therefore located on the same chromosome. The mother 
was heterozygous at these two loci. The values of a are the numbers of flies in which 
these loci had undergone recombination, commonly described as “cross-overs.” I 
have to thank Mrs Trent, of the Department of Biometry, University College, London, : 
for these figures. b = s—a, and р; = аѕгі. That is to say the value of p; is the 
estimate of recombination frequency from the i-th culture. The cultures are arranged . 
in descending order of pj Ае = 37.059, P(x?) = .00022, so there is very strong 
evidence of heterogeneity. 

Now if p were constant, its estimate would be 658/2916 = .2257 + .0073. 
In fact all estimates known to me have been made in this way. 


The mean value of p; is .2244, its median .2169. var (р;) = .002380, so 
gy = .0487. "This is much too high as an estimate of the variance ofp. The formulae 
(9) and (11) give Faq = kag = .001636, ka, = 001682. This is a very satisfactory 


agreement and we may estimate Ky as .00166, giving cp = .0407 which is considerably 


below the crude estimate of .0487. 
25 for ку, we find c = 105.03. Putting 


Adopting the provisional value of .2 
— 106.3, which does 


с = 105 in (7), kı = .2273. If we repeat the process we find ¢ 
not alter the value of /,. From (8) we find к.(18) = .0001908. Бо 
k, = .2273 + .0138. 
Thus the estimate of the mean of p is only changed from its classical" value 
by 12% of its standard sampling error. The change could be much greater if the 


a larger coefficient of variation. On the other hand its standard 


values of s had 
And we have a£ least an estimate of the variance 


sampling error is nearly doubled. 


of p. 


DISCUSSION 


This paper is a preliminary attempt to develop a field of statistics opened up 
by Robertson (1951). Ifthe sample size s is constant it is merely a matter of algebraical 
accuracy to obtain unbiassed and efficient estimates of all the moments or cumulants 
of the distribution of p, upto and including the s-th. On the other hand, at least 
with the approach here adopted, when s is not constant, one requires statistics of order 
2r to obtain efficient estimates of the r-th moment or cumulant. Formally this 
involves an infinite regress. It may be that the problem will be soluble in finite 
terms by Robertson’s or related methods. | 

215 . 


Vor. 21] ЗАМКНУА : THE INDIAN JOURNAL OF STATISTICS [Parts 3 & 4 


The example shows that second order statistics may suffice for practical 
purposes when all values of s are of the order of 100 and not very variable. On 
the other hand had the same number of individuals occurred in some hundred s 
in which s ranged from 1 to about 10, as in human families, fourth order st 
would have been desirable to obtain the correct weightings in evalu 


amples 
atistics 
ating kg. 

The numerical example shows that most of the 
probably inaccurate. 
slight revision. 


published data on linkage are 
The mean recombination values found may only require 
Their sampling errors are consistently larger than those published, 
The variances of recombination values will require estimation. A sufficiently variable 


value leads to spurious “interference” of the frequencies of crossing over in adjacent 
segments. In fact the whole theory 


of linkage will require revision when sufficient 
data are available. An attempt is being made to collect such data in this Institute. 
I believe that the а 
design of sample surveys, 
it is desirable, when /:, j 


gh estimate of Ko. 
Further the optimal design for the 
imal for the estimatio 


Some of the ex 


pressions here found ean 
theory of the analysis bs 


of variance, For example (1 ived b idi 

Y consider- 
being Constant in each 
the paper by discussing 
can be applied to the 


such limiting cases, and the mor 


e direct 
evaluation of higher moments, ESSE E 


I have to thank Mrs Trent (Mi 
lished data at my disposal, ru ema une „жш + 
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SUFFICIENT STATISTICS OF MINIMAL DIMENSION! 


By EDWARD W. BARANKIN 2 
and 
MELVIN KATZ, Jr.3 


University of California, Berkeley 


SUMMARY. For families of probability distributions characterized by certain differentiability 
properties — a type of family customarily mot in practice — the general problem of finding the smallest 
number of continuously differentiable, real-valued functions which can constitute a sufficient statistic is 
attacked. The precise local and global phases of this problem are formulated, and definitive results are 
obtained. 


1. IwTRODUCTION 


Let № = (u,, 0¢ 0) be а family of probability measures on the Lebesgue- 
measurable subsets of an open set О in Ej, a Euclidean n-space. A point of О will be 
denoted by x = (а, 2, ..., 25). The parameter set © will be an open subset of a 


Euclidean v-space, Ej; thus, 9 = (01, 6, ..., 9»). We suppose that each £, is absolutely 
continuous with respect to Lebesgue measure, and we consider that there is a deter- 


mination, p, of the family density function such that 
p(w, 0) > 0, (x,0)eQ xe, = (01) 


and such that the following derivatives exist and are, together with p, continuous 


throughout 2x@: 


36, 
(1.2) 
0p з = 1,2, sn; j=1,2, m 
00,01; 
Op Q—12,.,5; GS Bows 
Р 02,00; 
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this paper was supported (in part) by Tonda provided under Contrac 29 wi «ошен ен 
tion Medicine, USAF, Randolph Air Force Base, Texas. | | | 

2 During 1956-57, Fellow of the John Simon Guggenheim Е Foundation and Visiting 
Professor at the Institut Henri Poincaré, Paris, where a small se 9 t р к leading to thie article 

done. This author expresses his gratitude also to the Centro Universitaire International in Paris for 

Sa. veli facilities they so kindly provided. 

3 Now at the University of Chicago. 
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The continuity of the second derivatives implies that 


е. шеи. в e SEE we 2. (138) 
00,0x, ^ 0x00; 
The assumption of openness of О and © is not absolutely essential for the 
validity of our results, but we adopt it in order to avoid additional detail. In the 
' kinds of cases that usually arise in practice, and in which О and © are not both open, it 
is an easy matter to show that our results, when applied to the interiors of О and 
O—which are open sets—give the desired results for 2 x © itself. 


The assumption (1.1) is to the effect that О is the common carrier of all the 
densities of the family. Concerning the important class of families in which the car- 
riers vary with 0, we are hopeful that by suitable transformations of the problem 
it may be possible to apply bodily the theorems in this article—rather than re-prove 
them—in order to obtain the corresponding results. 


The motivating question to which we address ourselves in this articlo is formu- 
lated and investigated in terms of the following two definitions. 


Definition 1.1: А function T on О will be said to be Euclidean of dimension 
r at 20 if there is a neighbourhood of ° such that 7 maps this neighbourhood into a 
Euclidean r-space. 


Definition 1.2: Let T be Euclidean of dimension r at 2 e О; specifically, 
let 


T(x) = (h(x), hol), ..., hi (x) "ш (LA) 


in some neighbourhood of x°. Then we say that T is regular at a? if the functions 
h; i = 1, 2,...,7 are continuously differentiable and if the Jacobian matrix 


— || Jz, 2 .. ms (B) 
‘J=1, 2, .. 
is of rank r at 20. 


Notice that in this last definition, by virtue of the continuous differen- 
tiability of the #,, the matrix J is of rank r everywhere in a neighbourhood of 20. 
Notice also that this rank condition implies r X n. 

We may now pose this question: for a given point a? є О, what is the smallest 
integer r such that there exists a sufficient statistic for jo Which is Euclidean of 
dimension 7 at 2°, and continuously differentiable in a neighbourhood of 29? 
the limited conditions we have imposed on the family densit 
no reason to expect that this smallest integer r will be the з 
And indeed, our second example in Section 5 shows that this 1 


218 


Under 
y function p, there is 
ame for every point 20. 
ocal minimal dimension 


ee C—— 


м, 


SUFFICIENT STATISTICS OF MINIMAL DIMENSION 


of a sufficient statistic may vary over Q. The answer to the above question, for 
points 20 of О that we call regular points, is given by Theorems 3.2 and 3.3. Моге 
exactly, the sequence of results is as follows. Definitions 3.1 and 3.2 present 
а certain explicit, integer-valued function р-оп о. Then Theorem 3.1 shows that for 
any point a? e Q, p(x?) is a lower bound for the dimension at a? of a sufficient statistic 
which is Euclidean at 20 and regular at х0. By means of an additional argument, 
we give a direct strengthening of this result in proving, in Theorem 3.2, that p(x) 
is a lower bound for the dimension at 2? of a sufficient statistic which is Euclidean at 
a? and continuously differentiable in a neighbourhood of a? (but not necessarily such 
that the matrix J(x®) is of the pertinent rank r). Following this, Definition 3.3 charac- 
terizes a regular point of О, and thereupon we show (after proving two lemmas), 
with Theorem 3.3, that at a regular point a? the lower bound p(x°) is achieved. This 
theorem is proved constructively, so that it gives an explicit sufficient statistic which 
is dimensionally minimal at x°. These results comprise Section 3. Section 2 is devoted 
to establishing a series of lemmas toward the proof of Theorem 3.1. 

We go on in Section 4 to obtain global results out of the local results of 
Section 3. Evidently, the extent of globality that we may hereby obtain is necessarily 
limited to the set В of regular points of О. This set is an open set, and its complement 
in Q;-the set of nonregular points—is nowhere dines in О (see Lemma 3.1). In general, 
the Lebesgue measure of the set of nonregular points num not be 0. However, to 
encounter a case in which this nowhere dense set is of postive measure would be the 
unusual thing. In all cases of practical importance this set is of measure 0. Thus, 
in the stochastic context, our results in Section 4, being valid almost everywhere 
(Lebesgue) in R, are, for all practical purposes, fully global for Q. Theorem a 
establishes (constructively) the basic global result that there is a antient statistic, 
T*, which is of minimal local dimension and regular almost everywhere n E T* 
is continuously differentiable on an open subset a B yin complement in R » of 
asure 0; and among all sufficient statistics having this desirable analytical 
a statistic as T'* is the ultimate answer to the question of a minimal 
as В is concerned—and as far as О is concerned, if Q-R is 


Lebesgue me 
property, such 
dimensionality, as far 
of measure 0. 
The statistic 7* has, in general, different dimensions at ошон points of its 
i ly differentiable domain. One may prefer to deal rather with an almost 
ener ey ntinuously differentiable sufficient statistic of constant dimension over 
e co! fferentiable domain, and therefore desire to know the minimum 
possible value for this global dimension. This is in the spirit of most usage in the 


literature. Theorem 49 contributes to the answer to this question. Of course, 
erature. 


the minimal global dimension achievable within R is т" pa). 


everywher à; 
its continuously di 


4 we raise the obviously pertinent question of how far 
a sufficient statistic carries us toward functional minimality. 
fficient statistic we mean one which is a function 


Finally in Section 
dimensional minimality of uff 
By a (globally) functionally minimal su 
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(almost everywhere relative to each of the measuresin 7?) ofany other sufficient statistic. 
This is the minimality notion discussed by Lehmann and Scheffé (1950). Their result, 
Theorem 6.3, p. 336, shows that for the family 77 that we are concerned with here 
there does exist a functionally minimal sufficient statistic. We are able to show here, 
in Theorem 4.3, that the sufficient statistic T* of Theorem 4.1, or, equally well, the 
statistic T+ of Theorem 4.2., is locally functionally minimal almost everywhere in R. 
That is, for almost all points of R, there is a neighbourhood of a point such that, in 
this neighbourhood, T'*, or T+, is a function of any sufficient statistic. Whether or 
not this can be improved to global functional minimality almost everywhere in R, 


while preserving the desired property of continuous differentiability, is left an open 
question. 


Section 5 is the last section of the paper. In it we present two examples to 
illustrate our results. 


For our investigation we shall need a general criterion for a sufficient statistic 
for our family №. This is provided, in most convenient form, by a result of Bahadur 
(1954, Corollary 6.1, p. 438); this result has evolved from Fisher (1922) and the theorem 
of Neyman (1935) through the work of Halmos and Savage (1949, Corollary 1, p. 234) 
and Lehmann and Scheffé (1950, Theorem 6.2, p. 332). For easy reference, we shall 
state Bahadur’s criterion here as a lemma, in terms specific to our particular case. 
The statistic 7 appearing in the statement has no prior conditions on it; it is any 
particular function on Q, and it is rendered measurable by taking, as the pertinent 
c-algebra of subsets of its range, the class of all sets whose inverse images, under Т, 
are Lebesgue sets. If 72, denotes the range of 7’, the criterion is as follows. 


Lemma 1.1: (V.—H.—Sa.—L.—Sc.—B). А necessary and sufficient condition 
that the statistic T be a sufficient statistic for 72 is that there exists a nonnegative function 
f оп HpXO, and a nonnegative function g on Q, such that (i) for each 0 € 0, f(T(-), 0) 
is Lebesgue measurable, (ii) g is Lebesgue measurable, and (iii) for each 0 e © the equality 


plz, 0) = f(T (x), 0)g(z) (1.6) 


holds for almost all (Lebesgue) x e О. 


We shall make abundant use of the implicit function theorem. For an account 
of this the reader is referred to, for example, Caratheodory (1935, p. 9). The other 
analytical tools we employ, such as transformation of integrals, the Fubini theorem, 
etc., need no special references. 


Results of the kind achieved.in this article are important for investigations 
wherein one wants to have the advantage of differentiable statistics. 
among authors whose work is concerned wth such sufficient statist 
and Rao (see References). In addition, the question arises a; 
extending the now classical results of Fisher (1934) and Darm 
References) with the aid of results of the kind we obtain here, 
these applications in this article. 


Prominent 
ics are Bhattacharyya 
S to the possibility of 
ois and Koopman (see 
We shall not consider 
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The article of Dynkin (1951)— which has not appeared in the references of an 
of the English language articles on sufficient statistics, and of which the authors eee 
aware only when the present paper was in its final stages of preparation—is of interest 
on two accounts relative to our work here. First, Dynkin pursues his investigation 
of sufficient statistics with attention to their local functional character; secondly 
his Theorem 2 is an instance of a result determining, in a special case (product distri. 
butions), a sufficient statistic which is, locally, dimensionally and functionally minimal. 
In regard to the first point, the authors thus find another, and apparently the only 


other, investigator who brings into evidence the essential role of local considerations 


in regard to minimal dimensionality. In the present 
global situation is laid open rather completely. 

We remark that the problem of minimal dimensionality of sufficient statistics 
o one of the authors by J. Neyman. 


article, the general local-versus- 


was originally proposed t 


2. THREE LEMMAS 
The goal of this section is the establishment of the result stated as Lemma 


2.3 below. This is the important preliminary result that if the sufficient statistic 
at a? then the factorization (1.6) can be made locally 


T is Euclidean at 20 and regular 
and 9. Lemmas 2.1 and 2.2 are ancillary to the 


with fully differentiable functions f 


proof of Lemma 2.3. 
Lemma 2.1: Let the function T on Q be Euclidean of dimension v at a? and 


regular at a*, being given by (1.4) in some neighbourhood, N, of а%. Let a' and x" be two 
points of N such that J(x') and J(u”) (see (1.5)) are both of rank r, and such that 
T(a') = T(x”). 
Let № and №" 
A € N' and A" Є № 
are of Lebesgue measure 0. 
Then, there exist poini 
such that T(x") = T(x"). 
The proof of this lemma i 
We shall carry through the argum! 
place the simplification for r =n. 
Proof : Consider the point =”. 


be disjoint neighbourhoods of x’ and a", respectively, and let 
be Lebesgue measurable sets such that the sets N' —A' and N"— А" 


s al e A’ and a" в А", different from x’ and 2", respectively, 


nvolves more complication for r < n than for r = n. 
ents for the case 7 < 7, and indicate at the proper 


We may suppose, without loss of generality, 


that 


A(hy; ha, Е hy) = 0. 
(sn Tos s Bu e (2.1) 
Let J" denote the r-dimensional Euclidean range-space of T on N, and let y denote 
ace of 2) be represented as Ej x Ej-*, so that 


a point of E". Let 2" (the containing sp 
for any point 2 = (ду, as -> Tn) є Еп, we have (tra Bryar so Ün 6 
Ей" will be denoted by 2; and z! will denote the specific point [2 + 26, 95 «vss 2) аф 
is, the projection of 4' into E". Finally, let U denote the identity on Eg-7. 
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Tt follows from (2.1), by the Implicit Function Theorem, that there is a spherical 
neighbourhood S’ in Е”, centered at T(x’), and a spherical neighbourhood И” in 20”, 
centered at =’, and a continuously differentiable (vector) function e = ($1, Фо, ..., Ф) 
on S'x W' into Еф, such that the continuously differentiable function (e, U) on 
S' x W' is onto a neighbourhood, Ш’, of x’, and is the inverse of the continuously dif- 
ferentiable function (T, U) on D' onto S'xW'. We take 5” and W' so small that 
(i) D' с N', (ii) D' is bounded, and (iii) 


O(hy, ho, ..., hj) - 
(ser = me iN о ... (2.2) 


We have then 


lipis Pos sers Pry ая ees La) _ 0061, Pos .. 


1 
OY, Yo: "ea Yrs Sas “°з Xn) OY, У», m 


(20. №) 


A(x, Vo, ..., =) ед, 


The set № — А’ is a subset of N’ of Lebesgue measure 0. It is а Lebesgue- 
measurable set, but not necessarily a Borel set. We may, however, choose a Borel 
set in N’ of Lebesgue measure 0 which includes N’—A’; let B; be such a set. Then 
B' = N'—B, is likewise a Borel set and is a subset of A’. Thus, we have replaced 


A’ by a Borel set B' C A’ which also has the property that N'— В’ is of Lebesgue 
measure 0. 


Let С’ = B' N Г’. Since D' is an open set, C" is a Borel set, and D'—0OC' 


is of Lebesgue measure 0. Hence, if A, denotes n-dimensional Lebesgue measure, 
we have 


А6") = АР). 


(2.4) 
This common measure is finite since D' is bounded. 


D' с Hj being the image of Sx W' € E" x Bg- under (e, U), wo have 


ACD) uem 0(е1, Фэ, -.., Prs 2-41, 


sag By) 
dA,. s (8:8 
spi PUn Yor ge mau Bq) |” e) 


For the sake of brevity, let M(y, z) denote the integrand values of (2.5), that is, the 
common absolute value of the three members of (2.3). Thus, (2.5) becomes 


A(D)= | Ma. 
Sr 
Let K' denote the image of C' under (T, U). 'Then K' is the inverse image of C" 


under (e, U), and since (e, U) is continuous and С’ is a Borel set, it follows that K’ 
is a Borel set. Consequently, we have 


(2.6) 


A) = | Maa,,. 
3) 
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If we define the function k on Sx W’ by 


1 if(yz)eK', 
Ку, 2) = c 2:8) 
0 otherwise, 


then (2.7) may be replaced by 


(с) = | ka. ^ (29) 
S'xW' 
Let A, denote r-dimensional Lebesgue measure in the r-dimensional sphere 
S', and A,_, denote (n—r)-dimensional Lebesgue measure in the (n—7)-dimensional 


sphere W'. We apply the Fubini theorem to (2.6) and (2.9) to obtain 


a= |. n aa, | d, _. (2:10) 
Wi Ls: 

ana мс) = J | вах | Dron _ Qa) 
Wi 5’ 


where W} is a subset of W ^, with A,-,( Wi) = Anl W’), such that for each 2 e Wi, the 
integrands in (2.10) and (2.11) are A,-measurable and integrable functions (of y) on 
S'. Now, the integrand in (2.11) is every where less than or equal to the integrand in 
(2.10). Therefore, for each z e Wi, the’ inner integral in (2.11) is less than or equal 
to the inner integral in (2.10). Hence, by (2.4) it follows that for almost all z e Wi, 
the inner integral in (2.11) is equal to the inner integral in (2.10). Let z be such a 


point of W;; thus, 


к. ущ. dA, = | MG 23A. 2. (242) 


8! s 


The function M(. 27) on S' is positive throughout $', and therefore (2.12) implies that 


ky, 27) = 1 for almost all y € S. ss (8.18) 


there is az such that k(y,2) = 1 if and only if there is a z such that 
d only if there is an тв С" such that T(x) = y. Hence, (2.13) 
there is a point x € O" such that T(x) = y. Recalling that 
as : for almost all y € S' there is а point жє A’ such 


Now, for a given y. 
(y, 2) e K', thus, if an 
hat for almost all y e S’ 


asserts t à 
tate this 


œ € B' с A’, we may res 
that T(x) = y. | 
This is the result we have sought concerning the set A’. As we indicated we 


о, we have given the explicit arguments for the case r < п. However, it is 
А he arguments proceed in the same way, except 


in it, do not enter into the considera- 
appeal to the Fubini theorem. 


would di 
now evident that in the ca 
that the factor space Zi ^ and the sphere W 
tions, and it is therefore no longer necessary to 
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We now carry out the same kind of analysis for the set A", and we obtain the 
corresponding result, namely, that for almost all y in some sphere S” < H’, centered 
at T(x") = T(x’), there is a point жє А” such that T(z) = y. Thus, the spheres S’ 
and S" (are in the same space, Е”, and) are concentric, and if S denotes the smaller of 
the two, then our results on A’ and A” combine to give the result that, for almost all 
ує 8, there are points a^ € A’ and «Ив A" such that T(a!) = T(s) = y. Hence, 
on choosing such a у Æ the center of S, we have the result asserted in the statement 
of Lemma 2.1. The proof is therefore complete. 


Lemma 2.2: Under the hypothesis of Lemma 2.1, there exist sequences 
(70,5 = 1,2,...} © А’ and (z"9, s = 1,2,...1 С A", tending properly (that is, no 
point of the sequence is identical with the limit point) to а' and x", respectively, such that 
D(a") = T(z") for all в =1,2,.... 


This is an immediate consequence of Lemma 2.1. 


Lemma 2.3: Let T be a #-sufficient statistic which is Buclidean of dimension 
r at a? and regular at x°, being given by (1.4) in some neighbourhood of x°. 


Then, there is a neighbourhood N of 2° such that T(N) is a neighbourhood of T'(x°) 
and there are functions f, on T(N) x 0; and g, on N, such that 


р(х, 0) = f(T(x), 0) glx) for all ve N, 0c6, vgn ВЫ) 


y 2 
and za that f(y, Ө) has continuous partial derivatives A $5 " mig en Я 
i= 1,2, .,7; j = 1, 2, ..., v, in T(N) X 6, and g(x) has continuous partial derivatives 
20. = 1, 9, ...„ т, in М. 
да; 

Proof: Let Ny be a neighbourhood of 49 such that J(x) is of rank r for all 
жє No. It is a consequence of the Implicit Function Theorem that the image under 
Т of any neighbourhood of 29 includes a neighbourhood of T(z?) in Е”. Let NO bea 
neighbourhood of T(z9) which is included in T(N,). And define 


N = № N TW, ... (2.15) 


It follows from the continuity of T' that N is an open set, and hence a neighbourhood 
of 22. 


и eec ‚ then жє 7-09), and therefore T(x)e №®. Conversely, suppose 
y is any particular point of ТО), Since №) € T(N,), there is a point æ c N, such 
that y = T(z). For this x we have both a € N, and s e T-N); therefore, жє М. 
We have thus proved that n А 


T(N) = мо, ... (216) 


which is to say that for the neighbourhood N of 12, 


T(N) is a neighbourhood of 7(4) 
in F’. 
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Let us keep in mind also that, since N С N, the matrix J(v) is of rank r for 


all we N. ы 
Since 7 is sufficient, we have, by Lemma 1.1, that there exist nonnegative 


functions f}, on 72у X O, and g}, on ©, such that, for each 0 c o, 
p, 0) = f(T (2), 0) gla) ... (2.17) 


` for almost all (Lebesgue) ve 2. We shall confine our attention to this equation for 
xe М. For any 0 we have p(x, 0) > 0 for all жє N (see (1.1)), and therefore (2.17) 
implies that f(T (x), 0) > 0 and (х) > 0 for almost all ve №. By virtue of this, if 
we now choose a particular point 0° в 9, which will be held fixed in its role for the rest 


of this proof, we then obtain from (2.17) that 


ple, 0) _ HATO) „, 
ne ORO E us fam 


Thus, for each 05 = A is a function of T(x) almost everywhere in N. We shall 


now show that, in fact fan is a function of T(x) everywhere in N. 
s > , p UN 


For a given 0, let A be the subset of N for each point of which (2.18) holds. 
Then N—A is of Lebesgue measure 0. Let a bea paint of N—A. Suppose x" is 
another point of N such that T(x") = T(x’). We wish to show that ivy left-hand 
side of (2.18) has the same value for а and x”. We apply Seng 2.2 to obtain 
sequences (z'*) © A and {x} С А, converging properly to x’ and x”, respectively, 
and such that T(a'(9)) = T(z") for all s = 1, 2, .. It follows then, from (2.18), 


that 
pe, 0) _ ШУН) sS - (i) 
pla, 0) pia, 09) 
i i inuity of p, that 
From this it follows immediately, by virtue of the continuity of p, tha 
ple", 0) _ p0), С 
p^, 0) ple, 00) 
And this is what was to be established. 


TI od 2 isa function of T(x) ev erywhere in N; and this is true f 
118, а 0%) or eacl 
Oe d a such that 

© Непсе there is a function hi on T(N) хө, 


pa. 0) = für). 9) = в М, ее. M (2.21) 
ple, 09) 
225 


Vou. 21] SANKHYA : THE INDIAN JOURNAL OF STATISTICS [Parts3 & 4 
Tf we define the function g on N by 


g(x) = pla, 09), ve N, ... (2.22) 


then we have ‚<р 0) =f(T@), 0) 000), 20,060. ... (2.23) 


Tt remains now to show that f and g have the continuous differentiability properties 
asserted by Lemma 2.3. This result for the function g is immediate from (2.22), by 
virtue of the differentiability of p. We turn our attention to the function f. 


Let у’ = (01, 05,.:::0:) be any particular point of T(N). Let a’ = (vi, $2.05) 
be а point of N such that T(z') = y’. The matrix (2) is of rank r at x’; let us suppose, 
for simplicity, that 


O(hy, hos .. №) 
ا‎ #5. ... (2.24) 


We now have the situation that we encountered in the proof of Lemma 2.1, and to 
which we applied the Implicit Function Theorem to obtain the continuously dif- 
ferentiable inverse, (o, U) of (T, U). The function (е, U) is defined on S' x W’, where 
S' is a spherical neighbourhood in 2”, centered at y', and W' is a spherical neighbour- 
hood in Eż”, centered at 2! = (Хуу, 42> +++ vh). (Again as in the proof of Lemma 
2.1, there is no need to consider a W' in the case r = m.) For the present context, 5” 
and W' may be considered chosen small enough so that D'cN. Then (2.21) gives 
us the following : 


Лу, 0) = RT yc S, ze W', 060. 2. (2.25) 


From this expression for f(y, 0) in S'x ©, because of the continuous differentiability 
of p and e, we get immediately the asserted continuous differentiability of fin S’ xo. 


Since this result obtains for every particular point у’ e T(N), we have established the 


asserted continuous differentiability of f in T(N)xe. 


Lemma 2.3 is now fully established. 


3. THE LOCAL MINIMAL DIMENSION OF A SUFFICIENT STATISTIC 


The three theorems of this section constitute the basic results of this article; 
it is from them that we are able to deduce global results in the next section. We 
refer the reader to the Introduction for a descriptive outline of the sequence of con- 
clusions toward which we now begin to argue. 


For any particular point жє Q, let {jı ja, ..., Ĵa} be any particular collection 
of n integers, each chosen from the set (1, 2, ..., у}, and let (00, 0%, ..., 6} be any 
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particular collection of points of o (not necessarily all distinct). We define the 
E j пхп 


matrix 
Ца; jas № seg Ds 0%, 00 ам 9m) = (8 и) | 
2 3 *5*э 02,00; 8 (3.1 
| OUR ugs kot 2s oon Me 
Let £, denote the class of all matrices (3.1) for the given point v. 
Definition 3.1: The integer-valued function pı, on 0, is defined by 
рх) = max (rank L). .. (3.2) 


Lek, 


will denote the open sphere in 9 centered at a? and of radius a. 


The symbol So, « 
on Q, is defined by 


Definition 3.2: The integer-valued function p, 


p(x?) = lim max pile). es (8:8) 


a0 ve Sao, п 


With these definitions we now prove 

Theorem 3.1: If T isa p-sufficient statistic which is Euclidean of dimension 

r at a? and regular at a9, then 

r > pl). (3.4) 

Proof: Let T be given by (1.4) in à neighbourhood 2°. By Lemma 2.3, 

there is а neighbourhood N of 29, and differentiable functions f and g, as detailed 
in the Lemma, such that 


р(х, 0) = 0600), 0) g(x), «eN, 060. (3.5) 


We may differentiate this to obtain 


£ 
a logp _ 8? log f SY Е = 
ae =>, (д yere) \ Oti E Ы ا‎ БЫ 


02,06; 8=1 
cular v e N, and a set of integers (ju Jas +++» jn} and a set of 


Tt follows that for any parti 
points (000, ga, ..., 0) which define à matrix (3.1) for this x, we have the factori- 
zation 


(3.7) 


0) = Да). 


d 


$—1, 2, 
k=l, 2, ..," 


à? log f 

z ‚ . 90, 00, ... AEG 
L(x; Jas > seer Jn? ga, 9o, (a!) та), (00 
not greater than the rank of either factor; 
3.7) has rank r (and the second factor has 


f L in (3.7) is & Since this is true for 


two matrices is 
-hand side of ( 
at the rank 0 


of the product of 


The rank 
the right 


the first factor OP 
Tt follows th: 


rank at most 7). 
every Деб» We have рж) < 7- 
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This result holds for each хе N. Therefore, if о > 0 is small enough so that 


Sao, a € N, then we have max p(x) < 7. Since, clearly, for any sphere Si, a it 
, 26.810, а 


is true that p(z?) < max p(x), we get, finally, that p(a?) < r. This is the asserted 
26.920, а 


result (3.4), and the theorem is therefore established. 
Theorem 3.1 may now be strengthened, as follows. 


Theorem 3.2: If T is a #-sufficient statistic which is Euclidean of dimension 
-r at a? and continuously differentiable in a neighbourhood of a? (but not necessarily such 
that the matrix (29) is of rank r), then 
т > p(x). aa (8:8) 
Proof: Since p, is an integer-valued function (that is, its range is a discrete 
set), it follows that (1) for all z in a sufficiently small sphere centered at 20, we have 
Pix) < p(w), and (2) in every sphere centered at 29, of however small radius, there 
is a point z' with p(x’) = p(a?). "The fact (1) then implies that for the points a’ of (2) 
which lie in a sufficiently small sphere centered at 20, we have p(x’) = p,(a’) = p(x?). 
The continuity of the derivatives of p implies, finally, that for such a point 
x’ sufficiently close to 20, there is a neighbourhood of 2 at every point v of which we 
have p(x) = p(z') = p(x). These considerations show that in any neighbourhood 
of 20, however small, there is an open subset, for every point x of which we have 
P(x) = p(x). (See the proof of Lemma 3.1 below for complete details on these 
arguments.) 
Let T be given by (1.4) in a neighbourhood N of 20. Let № CG N be ( 


according 
to the preceding paragraph) such that 


plz) = p(x), we М. .. (3.9) 
Define r = max [rank J(z)], ... (3.10) 


and let z' be a particular point of N’ such that 


rank Ла) =r. (3.11) 


We now have the situation that J(x) is of rank »' at а (by (3.11)), and for each 
жє N’ (a neighbourhood of x’), every minor of J(x) of order > 7” vanishes (by (3.10)). 
It follows, by a familiar course of reasoning from the Implicit Function Theorem, 
that there is a neighbourhood N” (a subset of №) of x’, and some specific subset of 
7’ of the functions hy, hg, ...,h,, such that the Jacobian matrix of these 7’ 


functions 
is of rank r’ throughout N” and such that T is 


а continuously differentiable function 


of only these г’ functions in N”. If the 7” functions in question aro his, his, ..., hip, 
then define 7" as follows : 
(hi (=), hi(x), seg hi, (a), we N”, 
icm ... (3.12 
T(x), 260—1". (3.12) 
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Clearly d i i i v ат eover, 10 1s à ready 
arl QT is Euclidean of dimension 7’ at z' and regular at 2 Mor it d: 
consequence i i | ence. 
of Lemma 1.1 that, since T' is a sufficient statistic, so also is 7" H 
4 E . nce, 


by Theorem 3.1, 


T > р@'). .. (3.13) 


Now, finally, we have r > 7' and, b i 
$ у a ‚ by (3.9) since x’ eN’ ‘y= 
Applying (3.13) to these two facts а Kd 
SE we get (3.8), and the proof of Theorem 3.2 is com- 
Having thus established that the function p provides a lower bound for the 
local dimension of a locally continuously differentiable, Euclidean sufficient statistic 
we turn now to the task of showing that this lower bound is attained at points of о 
of a certain kind. The following definition characterizes ‘the kind of point in 


question. 
Definition 3.3: A point хє0 is said to be a regular point of О (for the 


family #) if 
ple) = рі). (3.14) 
incidence of regular points in О, before going 


Тп order to have an idea of the 
we prove the following lemma. 


on to derive results concerning them, 
The set R, of regular points of О, is an everywhere dense, open 


Lemma 3.1: 
of nonregular points of О, is a nowhere dense 


subset of О. (Equivalently, the set Q—R, 
subset of О, closed in 2.) 

The function p is continuous on R. 
et N be any open subset of о. We shall show that N contains 
ereby proving that R is everywhere dense. Let a? be a particular 
y sphere Sx, a contained a point x with p,(x) > r (some particular 
Definition 3.2, we should have p(z?) > ^ + 1. In particular on 
ave the self-contradictory assertion that p(x?) > p(w°)-+-1; 
C N such that 


Proof: L 
a regular point, th 
point of №. Tf ever. 
integer), then, by 
taking r = p(w), we shauld hi 
and therefore it follows that there is a sphere Sa, ar 


рца) < p(a9), 2 6 Sa, i ° (3.15) 


f all 26 Sx, a, , then it would likewise hold for 


ality held in (3.15) fo 
efore we should have max p,(x) < p(x®)—1 


Tf the strict inequ 
and ther 
€ S29, а 


all spheres Sao, а With @ < 0, 
(00) = lim max pile) < pa?) —L. which is again a self- 


a0 16.519, a 
xists а point 2 € S25, «i 


for а < од, and hence p 


contradiction. Thus, there e such that 
pila’) = Pe). (3.16) 
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Now, taking (3.15) in particular for 2 = 20, and realizing that N could be chosen 
suitably for any prescribed point 2°, we see that we have 


pie) < р(х), «eo. ss (Ou) 


Applying (3.17) in particular to the point a’, we have 


pil’) < p(w’). ... (3.18) 
Let Sz’, p С Sao, а. Then by (3.15), max р(х) < p(w), and therefore, 
263, в 
by Definition 3.2, 
plz’) < p(a?). e (3.19) 


Combining (3.16), (3.18) and (3.19) gives 
(а!) = p(x’) = p(x). ... (3.20) 


Hence, in particular, z' is a regular point. And since 2’ є Sao, «, € М, we have hereby 
proved the everywhere denseness of R in O. 


To complete the proof of the lemma, let z' be any regular point. By the same 
argument that led to (3.15) above for the point 29, we have in the present case that 
there is a sphere Sz’, a, such that 


plz) < ple’), ue Ss, a, . ... (3.91) 


On the other hand, there is a minor of order p(x’), of some matrix (3.1), which is 


nonvanishing at 2’; since the partial derivatives constituting this minor are con- 
tinuous, there is in fact a sphere 82°, a, at every point of which this minor is non- 
vanishing, and therefore 


рш) 2 pfe), ае, a. (3.22) 
If Sx’, а denotes the smaller of the two spheres Sa’, о, 


and Sar, а, th bini 
(3.21) and (3.22) with the fact that p(w’) = рца") as USN, GORDIE 


© ae 
(since z' is regular), we get 


рж) = pm) BE Garg. (3.23) 
Thus, in particular, p, is constant in Saa. It js аду veined with Definition 
3.2 that in this circumstance the function p is likewise constant in Gz, and with 
the same value as ру. Thus, we have z',a, and wit 


ple) = рш) = p), te Sura. "n 
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аа shows immediately both (1) that all points of Sx’. are regul 
points—therefore R is an open set— 2 i нев ү. 

| р and (2) that nm a = p(x')—therefore p is 
continuous on R. 
This completes the proof of Lemma 3.1. 


For completeness we include the followin, i 
g alternative characterizati 
regular points of 2. bab 


Lemma 3.2: В is precisely the set of points of continuity of the function p 
Fe 


Proof: Since by the preceding lemma, p is continuous on the open set R, and 
p, =p on R (by definition), it follows that every point of В is a continuity рше of 
1 


Pr 

suppose 2° is a continuity point of ру. Since p, is integer-valued 
(thus, with discrete range), there is then a sphere Sao, for every point x of which 
pix) = ри). № therefore follows immediately from Definition 3.2 that p(x?) = p,(«°) 


that is, ae R. This completes the proof. 


Conversely, 


We are now ready to prove 


If a isa regular point of О, there exists a Z2-sufficient statistic, 


Theorem 3.3: 
(a9) at 2°, and regular at 2°. 


T, which is Euclidean of dimension p 
the result follows immediately by taking T(x) == 2. 


(29) < n. 


For the regular point 20, let us set, for brevity, ту = p(x) = p,(2°). Ву 
Definition 3.1, there is а matrix (3.1) of which a particular го X79 submatrix is non- 
suppose that the 7) хт submatrix 


Without loss of generality we may 
coordinates of x. Thus, there exist points 0%, 0, 
each chosen from (1, 2,...,v}, such that the 


Proof: 1 p(x?) = n, 
We go on to consider the case p 


singular at а. 
in question involves the first o 
.... 0 in Ө, and integers Ji Jo -Jro 


matrix 


log p 


A(x) = | (дб, Jae (3.25) 


"i, k = 1,2,...› fo 


is nonsingule By virtue of the continuous differentiability of p, there is then 
hbourhood of a, at every point of which A(x) is nonsingular; choose such a 
urhood, say Sa? But furthermore, since 20 is а regular point, it 
3.1 that we may choose 


т at 2%. 


a neig 
spherical neighbo 
follows by Lemma 


Зо, а 80 small that 


pla) = pyle) = pa), E «Sa, а; ... (3.26) 
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that is, every point of Sao, a is a regular point and p is constant in «За, а. So ok 
ing Sao, а, we therefore have the following two facts : (1) A(x) has nonvanishing 
determinant for all 2 €.S2»,«, and (2) for any point 0e ө, any integer j = 1,2, ...,v 
and any integer s = 1, 2, ..., n, we have 


| (282 | 
| Ox, 00, Я в) | 
ld а | 
ii | =0... (3.27 
| 0° log p | E (3.27) 
| dui : gi") | 
ЕЕ жыйа» So Бш шз, жой эе Жж з | фак, ج‎ ee — СЕ 
ё log р (2 log p | e log p 
02,00; 2,0 On, 00; 2,0 | Ox, 00; 2,0 


Now, it follows from these facts, by a familiar argument with the Implicit 
. Function Theorem, that there is a neighbourhood N С Szo,a of a? such that, for each 
0 € ©, the functions 


8lgp| , __ р 
| po Jp LS ы ... (8.28) 


are in N, continuously differentiable functions of only the functions 7,, defined by 


— (2 log p 
та) = | 00), ) ay” 1,2, ones Tos ... (3.29) 


the neighbourhood N being such that its image under N = (71, 7, +++) Nro) is an open 


In other words, there are func- 
tions Fj on SX, which are continuously differentiable in the coordinates 
Yar Yoo «+> Yr ОЁ а point ye S, such that 


sphere S in an ry-dimensional Euclidean space Ev, 


ô log P qy р 
ag, as le HEN, Cee, j= 1,9, a. 


(3.30) 


It is a consequence of the relations (3.30) that there exis 


6 continuous functions 
F, on 8x96, and ©, on N, such that 


log p(x, 0) = F(y). 0) + Gle), (ж, Өує Nx 
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and such that the derivatives 


OF 
Г ди; * E= 2; 5 fy 
| 
OF А 
790, ° j= 1,2, os 


(3.32) 


or Or 
ду,00; , 00,04, ? k 1, 2,...› Toi J = І, 2,....,. 


L Ox; 


ctive domains Sx@ апа №. A sketch will suffice 


are all continuous in their respe 
ablished. From (3.30) for j = 1, it follows that 


to show how this result may be est 
log p, 0) = F (n(x), 0) +G (x; ba, 03, EI 0,), 


where I’, is an integral, with respect to 01, of Fi, and so has the continuous differen- 
tiability properties described for F in (3.31) and (3.32). Since F, has these properties, 
and 7 is continuously differentiable, and since log p also has the continuous differenti- 
ability properties, it is implied by (3.33) that бү likewise has these properties. Dif- 
ferentiating (3.33) with respect to 0z and equating the result to the right-hand side 


of (3.30) for j = 2, we obtain 


(3.33) 


96, _ w_ ОР 
50, = Р:— 3a ... (3:34) 


pends on x only through 7, and is, moreover, 


The right-hand side of this equation de 
independent of 0, since the left-hand side is. Therefore, (3.34) asserts that 
був, Day 06-5 б) = Фора), бь бы s DR бы бы ov Bay (8.28) 
ave the stated properties of continuous differentiability. 


^ 
Inserting the right-hand si 3.35) into (3.33), and setting F, = F,+F, we obtain 
o 


log p(x, 0) = Fyle), 0)4- Gales 03, ба» «++» Iv) .. (3.36) 
the conditions (3.30), we ultimately 


nh 


de of ( 


^ ' 
where F, and G agai 


Continuing in the manner indicated through all 


obtain (3.31) and (3.32) 
Define the function T on О as follows : 
weN, 


17 (=); 
= s. (3:87) 
T(x) { 4 жє Q—N. 
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J@n, the range of T, is S U (Q—N). On ver xo define the function f as follows : 
е0,  (u,0)e Sxo 
Хи, 0) -{ ... (3.38) 
p(u, 0), (и, 0) ¢(2—-N) xo 
and define the function g on 2 by 
em, eN, Р 
g(x) = { .. (3.39) 
d 269—1. 


Then, by virtue of (3.31), we have 


plz, 0) = f(T(z), 0) gœ) (v, 0) в axe. ... (3.40) 


This relation will verify the sufficiency of T as soon as we demonstrate the requisite 
measurability properties of f and g. 


Regarding f, we must show, for each fixed 0, that the inverse image, under 
ХТС), 0), of a Borel set on the real line is a Lebesgue set in О. Let H be a Borel set 
on the real line, and define the sets (for a particular, fixed 0) 


А, = {we S|eF 9 e H}, 


(3.41) 
A, = {ue 0—N |p(u, 0) c H}. 
Then, 
inverse image of H under f(T\(-),0)=T-(4, U 4, 
= HA) U А. (3.49) 


Since e" 0 and p(. , 0) are continuous functions on S and Q2—N, respectively, the 
sets A, and A, are Borel sets. Also, 7 is a continuous function, and therefore 1271(44) 


is a Borel set. Hence, the union of 77 А) and A, is a Borel set in 9, and we h 


ave, 
by virtue of (3.42), proved the Lebesgue ( 


in fact, the Borel) measurability of f(T'(-),0). 


From the Lebesgue measurability of p(-, 0) and FTC), 0) in (3.40), it follows 
that g is Lebesgue measurable. Hence, Lemma 1.1 asserts that T' is ^ sufficient 
statistic. In the neighbourhood М of 259, T = 7; and 7 is continu ously differentiable 
on N into #7, and its Jacobian matrix is of rank ry at 49 (the matrix A(x) (see (3.25)) 
which is nonsingular at 20, is an 79X79 submatrix of the J acobian matrix of y. 
Thus, 7 is Euclidean of dimension p(x?) at 2°, and regular at 29, 


Theorem 3.3 is therefore fully proved. 
We remark that (3.29) constitutes an explicit characterization of the function 
y with which we have constructed the sufficient stasistic T in 


: Ld y 8.87). That is, 
our proof is explicitly constructive. ( ) а 
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4. GLOBAL ACHIEVEMENT OF MINIMAL DIMENSIONALITY 


І Theorem 3.3 gives rise to the question: for how “big” a subset of R is it 
possible to achieve, in a single sufficient statistic, the local minimal dimension, p(x} 
at all points x of the subset? The next theorem provides an answer Б; 


Theorem 4.1: There exists a #-sufficient statistic, T*, which, for almost all 
(Lebesgue) regular points a? of o, is Euclidean. of dimension p(a*) at 2°, and regular 
at a. ч 

Proof: The key to this result lies in the observation of what has actually been 
established in the proof of the preceding theorem. It will be seen, on re-examining the 
details, that, in fact, the sufficient statistic T which we constructed has the following 


very point v of the neighbourhood М is a regular point, and for each хе N 


property: e 
and regular at x. We shall employ the full 


T is Euclidean of dimension pla) at x, 
force of this to prove the present theorem. 


The set В is an open set, and it is therefore a denumerable union of mutually 


disjoint cells : 


R= UTs ... (4.1) 


er faces included and the lower faces excluded.) 


(A cell is a bounded interval with the upp 
ation (4.1) that, for each cell J,, its closure, 


Moreover, we may take such a represent 


IL, is а subset of R. 
Now our observation above, concerning the full implication of the proof of 


Theorem 3.3, enables us here to assert the following, for each s = 1, 2,... : for each 
open interval Ks, хо, centered at 20, lying within А, and a 


which is Euclidean of dimension p(x) at each point a є Ks, a», 
By the Heine-Borel Theorem, there is a finite 


point a? e Г, there is an 


sufficient statistic T's, v 
h point 26 Ks, 2°. 


and regular at eac 
a eT} which covers T,. Let this finite, 


e collection (Ks, хо, 
and the corresponding T; „08 be 


pe (Kj Ki + Ky. p 
procedure, using the bounding hyper- 


subcollection of th 

covering subcollection 

da Res qi. Then, by a well-known 
, Togs ee tit 

planes of the К 1, in which to define new faces whe 

a finite collection of mutually disjoint cells, (Ks К. 


subset of some К, and such that 


ere necessary, we are able to designate 
.. Kst}, each of which is а 


Û Ky = 1. .. 42)" 


"d text |. let T, be some particular one of those T such that 
e dy eorr i isti i i 

ү т бе ore ие Sires t, the sufficient statistic Т„ is Euclidean of 
ai € Ag: a re = interior of Kso and is regular at each a € Кї. 


dimension f(x) 
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Consider the above carried out for all s = 1, 2,.... Let the functions f; 
and g,; (with the continuous differentiability properties found in the proof of Theorem 
3.3) be such that 


px, 0) S Ja (Ty (x), 0) Isi (x), (x , 0) € Ки x0 3 ase (4.3) 
= 1:25 
gs 1, 2, 


In particular, for each pair s, $ such that the (constant) value of p on K, is п, we take 
T (x) = x, р, (x, 0) = р(х, 0) and gs; (ж) = 1. (This is in agreement with our definition 
of T for p(x?) = n in the proof of Theorem 3.3). 


For each of the integers r = 1, 2, ..., n—1 in the range of the function p on 
В, let Е’ be a fixed r-dimensional Euclidean space. For every index pair s, i such that 
Tis Euclidean of dimension r, we shall consider D,; the range of K,; under Typ 
аз а set in Е". Since our present statistics Т,, are of the kind constructed in the proof 
of Theorem 3.3 we have that the sets D, are all bounded. It follows, then, by virtue 
of this boundedness and the denumerability of the collection of the D, that there 
exist translation vectors v,; in the respective imbedding spaces BH", r = 1,2, ..., n—1, 
ofthe D, of positive dimension < n such that the following holds : if Df, denotes 


the translation of D,; by v; then for any two sets D,; and D; in the same E", the 
sets Dj, and Df; are disjoint. 


The cases in which T, is Euclidean of dimension 0 are those in which the 
Т are constant. The D,; for such a T';; is (a set consisting of) a single number. For 
these cases we can likewise define numbers v, such that no two of the numbers 
Dj, = D,4-v, are equal. 


With the v, defined as here described for the T, of dimensions 0, 1, 2, ..., 
. " ^ А > ЕД фия 
n—1, and with v, = null vector in Ej for а T, of dimension n, we define 


Ти (2) = Tu (0) 0, зе К 


"T 2. (4.4) 
б = 1, 2,...,1,, 
8 — 1,2, 
and falu, 0) = fs (и—%, 0), (и, 0) € Dy Хо; (4.5) 


qx $ ossi, 
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We are now finally able to make the crucial definitions : 


m Th (0), zek = 1, Өр: ics gs а 
uc 2. (4.6) 


2, ео —R, 


x fa (u, 0), (u, 0) c DX ө; i = 1, 2, e 03 = De as 
Го 0) = es AT 
plu, 0) (и, 0) в (p—R)xX 6, 


ыш), ue Kg; FHL 5906 5— D uses 
... (4.8) 


and g(x) = 
1, 269 —R. 


a function T* over all of о since R= Û Ky. 
5 


The first of these definitions specifies 


say ZO, is clearly (U 2) U(9— 
n of this function f* that it has been neces- 
s of the sets Dsi We could not directly 


The range of T", R), so that f * js a function defined 


on ép. X O. It із for purposes of definitio 


sary to introduce the disjoint translation 
define, for all i and s, f'(w, 0) = fail 0) for u € Dy, 0 € Ө, because if two sets Dj; 
intersection, there is no assurance that the functions fs 


and Dp; have а nonempty 1 i 
and f,, agree on this intersection. We may remark that it was not necessary to 
translate those Ds; for which T; is of dimension ”, since in t 
Ты = identity function on @ and therefore Ds; = Kio and the К» are already mu- 


tually disjoint and disjoint from a—k. 
o of (4.3)—(4.8) that 


0) 6 2X0. ... (4.9) 


hese cases we have taken 


Tt is an immediate consequence 
ple, 0) = Ја), 0). дуа), @ 

ut explicitly tho demonstration of tho Lebesgue (in fact, Borel) 
ns f*(T*(- ), 9) and g*. The proof runs along the same 


of measurability of the function fü). 0) in (3.40). In the present 
single Borel set 12041) in (3.42), à denumerable union 


case there will be, in place of the 
of such Borel sets. And one will have to employ here the fact that translations are 
Borel measurable functions. 


e, (4.9 fulfills the conditions of Lem | 
Hence, ( en ze U К T* is Euclidean of dimension р(х) at v, and regular 


Moreover, for € Я 
j 


this property © 
only to observe 


We need not carry O 


measurability of the functio 


lines as the proof 


ma 1.1, and T*is a sufficient statistic. 


of the Т„ in K% is preserved under the translations (4.4)). 
that the complement in R = U Ks of the set E 

$5 8, 

yperplanes (that is, а union of faces of 


any В 
dimensional Lebesgue measure 0. 


ata (obviously, 
Tt remains, then; 


is a union of subsets of denumerably ™ 
denumerably many cells), and so 18 of n- 


This completes t 


he proof of Theorem 4.1. 
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The theorem we have just proved concerns global achievement of the local 
minimal dimensions of a sufficient statistic. It does not deal with a globally defined 
minimal dimension. On the other hand, it is customary practice in the literature to 
refer to a global minimal dimension. Specifically, reference is to “the smallest number 
of continuously differentiable, real-valued functions on О which can constitute a 
sufficient statistic." This particular definition of global minimal dimension is cer- 
tainly more strict than it need be, in requiring continuous differentiability throughout 
о. А more useful definition, in the light of our results, would be the following one : 


Definition 4.1: The (almost sure) global minimal dimension of a "sufficient 
statistic—to be denoted by d,—is the smallest integer r such that the following is 
true: there exists а 7-sufficient statistic T' and an open set A C Q, with Q—A of 
Lebesgue measure 0, such that 7 is Euclidean of dimension r at every point of A, 
and is continuously differentiable throughout A. 


The following theorem now expresses the implications of our above results 


regarding the number d, and the construction of a sufficient statistic of global mini- 
mal dimension. 


Theorem 4.2: The global minimal dimension dy satisfies the inequality 


dy > max p(x). ... (4,10) 
weR 


There exists a 7?-sufficient. statistic, T+, which is Euclidean of dimension тах p(x) at 
E 


every point of an open set A C R, with R—A of Lebesgue measure 0, and which is conti- 
nuously differentiable throughout A. 


Hence, if Q—R is-of Lebesgue measure 0, then equality holds in (4.10), and 
T+ is а #-sufficient statistic of global minimal dimension. 

Proof: The inequality (4.10) is an immediate consequence of Theorem 3.2. 
A statistic T^ of the kind asserted in the second statement in the theorem 
can be obtained by a simple modification of the statistic T* constructed in the 
proof of Theorem 4.1. And, like T'*, this new statistic will be Euclide 


Е 5 E an and conti- 
nuously differentiable on КО, which is а subs + ; 
y U » bset of R whose complement in В 


js of Lebesgue measure 0. 


Since the idea underlying the construction of T+ ai 
whereas the details are cumbersome, we shall content ourselv. 
how it is to be done. For brevity, set r, = max p(x). 

zeR 


t of T* is quito simple, 


es with merely indicating 


„. = , 
If T^ is of dimension r < r, 


over K,, write in more detail, 


Tale) = (А (2), (а), өз 1.02) (4 11) 
where the h,;; are the continuously differentiable com 


x Ponent functions of T*. Let 
Crit» Сы» so Csi, ru-i е 7y—T constants, and define =" 


T (®) = (ac), hix), өө № (а), Csil» Cai ..., Csi, т.г), жє К. (4 12) 
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Thus, by this device, the range of each T+ on its Kg; i i 
oo wo view M these типан аз ling ina Ronnie! = m i. 
; owing ы. up with the device of translations, as in the proof of the ee E 
RN e ti = n, possibly also contraction die. iie. 
т "e a points) if they are needed—we are able to arrive at functi 
1а on the K,,, which have mutually disjoint ranges, and ranges which ar ШЫ 
in the case 7, = n, disjoint from Q—R. We then define lodi: 
Г TH), = e Ky; û = 1, 2, o b S = 1,2,..., 


TH) = 
(2) i (4.13) 


м, xve o—R, 


by the disjointness of the ranges of the T$ on the K,;, to 
si? 


define, as in the preceding theorem, functions ft and g* which verify that T+ is a suffi 
cient statistic. A Т}, being formed from Т by a translation and (possibly) : 
война, is of the form aT;; +b for some constants a and b. Hence, it is clear s 
T% is continuously differentiable in K$. And therefore, since, also, each T$ is 
Euclidean of dimension 7, in К, we have, by (4.13), that 7% is Euclidean of muc 
fferentiable throughout this set. 


гү at each point of U K}, and continously di 
8,1% 


and we are enabled also, 


The last assertion in the statement of Theorem 4.2 is obvious, and the proof 
of the theorem is therefore complete. 


Finally, we say à word about the relation between dimensional minimality 


of у2-ваййсїеп& statistics and functional minimality of such statistics. Lehmann and 
Scheff (1950) examined the notion which they termed “minimality” of a sufficient 
we shall here designate this as “functional minimality.” A sufficient statis- 
tionally minimal one if it is almost everywhere (Lebesgue) 
sufficient statistic. The authors mentioned have shown, 

y there does exist a functionally 


for our present family 
The question therefore arises: how far toward functional 
а dimensionally minimal (local or global) suffi- 


ve fairly readily is expressed by the 


statistic; 
tic for our family # is а func 
in Q a function of any other 
in their Theorem 6.3, that 

t statistic. 

ve by choosing 
er that we are able to gi 


minimal sufficien 
minimality can we arri 
cient statistic? Ап answ 


following theorem. 
statistics Т* and T+, of the preceding two theo- 


almost all (Lebesgue) points of R. That is, 
A of measure 0, there is a neighbourhood Nof x 
then T* and T* are functions of T in N. 


The j-sufficient 
rems, are locally functionally minimal at 
for each point u in a set AS В, with R— 
such that if T i$ any 7-sufficient statistic, 


Proof; The set A in que 


(referring to the ele 
on Ksi 
al functi 


Theorem 4.3: 


stion may, again, be taken to be J K}. In fact, 
Si 


ments in the proof of Theorem 4.1) for 
a function of any other sufficient statistic 
onal minimality of T* then follows 


what is true is that 
the statistic T; 38; 


each pair 8, t, 
lished, the asserted loc 


Once this is estab 


from (4.4) and (4.6). 
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To establish our assertion concerning T; we note that in K, this statistic 
has, for its component functions, a set of functions of the form (3.29). Thus, we have 
to show that each of the functions 

pum ... (4:14) 
98, 109 
in question is a function of any sufficient statistic. Let T be any particular sufficient 


statistic. Then, by Lemma 1.1, we have the representation (1.6) for some functions 
f and 9. It follows that 


0 log p 


— (д1о&/(Т@),.) 
90, em T (3008/0563. 3] n ; ... (4.15) 


00), 


and this shows immediately that the functions (4.14) are functions of T. 

This completes the proof of the assertion of Theorem 4.3 for T'*. 

To prove the assertion for T+, we have merely to note that on K, each compo- 
nent of Т is either a linear function of the corresponding component of T; or à 
constant. From this, by virtue of the result already established for T'*, it follows 
that Т+ is a function of any particular sufficient statistic, in each K,. ‘This proves 
the asserted result for T+, 

We have therefore completed the proof of Theorem 4.3. 


The question remains open as to when we can assert the existence of a suffi- 
cient statistic which is continuously differentiable in an open set A С В, with R—A, 
of Lebesgue measure 0, and which is functionally minimal in A. 


5. Two ILLUSTRATIVE EXAMPLES 


In this section we shall apply the preceding results in two examples. The 
first example pertains to the well-known Behrens-Fisher problem; here we show that 
о—В has Lebesgue measure 0 and that p(x) = 4 for жє R. The second example ex- 
hibits a problem where again Q—R has Lebesgue measure 0 but р(х) is not constant 
for ve R. 

Consider m+n, (m,n > 2), independent random variables X X, 
Y, ..., Y, where X; is a normal random variable with mean 0, and аня far 
i= 1, ... m and Y; is normal with mean 0, and variance 0$ for j= 1 4 ки 
77 = {u :0 € 0} be the family of measures generated by the joint p " f 
Xs Ty iin ! Е: 


0 = (A, 0., 03, 9.) 
© = (—00, co) X (0, o0) X (—00, со) x (0, co) 
Q = En wx (5:1) 


L А4) = а ; | 
pa) J p.35 бй dys Ala, ить sot 


240 


SUFFICIENT STATISTICS OF MINIMAL DIMENSION 


.., Yn; 0) denotes the joint density of X,,..., Y,. It is clear that condi- 
(1.2) are met and we proceed immediately to the calculation of 
The calculation of the various mixed partial derivatives is routine but 


where p(2, . 
tions (1.1) and 


p(x) and p(x). 
we list the results for the sake of completeness. 


0106р _ 0105р _ 4 


dx, 00, ду; 00s ji ds ees 


+ { = 1,..., m (5.2) 
0*logp — A. dle 
ду; 005 02 
0187 ے‎ 2) 6) j=l № 
[ 9. бй” 
Thus it is clear that 
sud фииль OP o OP SE ... (5.3) 
Let j шей. у= 2, js 3, ja = 4 and j=- = nin 1, then 
ү == 
Ца; 12,24 1 ر‎ 1509, ..., BM) = 
2 "ue wn 
E 5 Е (0—00) 0 og oa gp? 
oy” 05 jJ : 
| 4 0 l — Sh a 
1 = |» (0—00) 0 0g en? * 
oy og (5.4) 
0 | 2 (0—0) 0 0 | 
0 ep 09 
0 a X y,—09) 0 0 | 
0 og E 
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Hence, defining the set N by 


N= ЕЯ 2 = m. enr aye e m ГАА ... (5.5) 


we see that for хе 9—N we have 


rank Lesl, 2, 8, 4, 1 sey TOE sag OPM) SA ... (5.6) 
and pe) = py), mi { Т) 
while for « e N p(x) > pix). (5.8) 
Next we note that aN) = 0 (5.9) 


and since до is absolutely continuous with respect to Lebesgue measure and conversely, 
it follows that the Lebesgue measure of N is also 0. From (5.7) and (5.8) we have that 


2—-R=N ... (5.10) 


and hence we have shown that В is an everywhere dense set whose complement has 
Lebesgue measure 0. Finally we define the function 


f m m 2 n n 2 
Tz) = Ха, Vat, Dy, È y) fs are iti 
i=1 i=1 j=l fat 


(5.11) 


and note that with j =1, j= 2, ją = 3, j,— 4 and 0M = 0% = 03 = 0% 


= (0, 1, 0, 1) for ж e R this is, up to additive constants, exactly the function defined in 
the proof of Theorem 3.3; thus T(x) is a sufficient statistic for the family 77 defined 


by (5.1) which is regular everywhere on R and of minimum dimension everywhere 
on R. 


For the second example we consider two independent random variables X, 
and X, with density functions q(2, 01) and q(Z, 0,) where 0; c (0, co), (i = 1, 2) and 
q(£, 0j) is defined as follows : : 


f 1 


LE d 
148V 2n ves 
"T ee сс Bi 
15 9: 140, on < 4 <1 (5.12) 
1 == | 
| [T6 эы es 1, 


- 


i= 1,2. The family of measures 7 = (ji : 0 є ©} considered is that generated by the 
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joint density of X, and X,, i.e., 
0 = (0, 0.) 


© = (0, оо) x (0, со) 


ne | теь Xp; Ü)da,dx,, A a Lebesgue set 


A 


| 
)513( .. ف 
; | 

L 


where p(a,, ж; 0) is equal to qas. 01) q(ta, 02). Clearly p(x ; 0) is positive and conti- 
nuous at all points of о хө and an easy calculation verifies the existence and conti- 


nuity of all the necessary partial derivatives, 
Now we define the sets 
( A = ern < 0) 
Ag = {|X < 0, % > 1} 
A, = |, > 1, < 0} 
А, = |5, ж < № 
4 4, = {xla,<0,0Km < I} (5.14) 
Ag = 2|0 <а<1, d 
#0 a < 122 1} 


1,0<ж<&<1)} 


> > 
L A, = (t 0 < =, 2 < 1 


Clearly © Ü A; and minimum dimension will be computed by considering the 
early © = a 

problem in individual Aps. Let же Ах; then 

—a3[201—23[203 


1 ... (5.15 
ples б) = x9, /2)0 4-6, 20) е (5.15) 


апа 


— s. (5.18) 
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Thus 
rank L(x; j, ja; 0%, 09) < 2 for x € Ay. sm. (BUM) 


Choose j, =1 and ј = 2, then 


| 
| 25 0 
| oq 
Le; 1,2; 0, 0) = | =. (5.18) 
о a 
| op 
and hence p(x) = р(х) = 2, ze Ay. (5.19) 
Similar reasoning shows that 
pla) = рж) = 2 we As U As U Ag ... (5.20) 
4 
and we therefore obtain u А; С Е. ... (5.21) 
Now consider x € interior Ag, then 
H — 21/201 
ple; 0) = = is 22 
(FOV LOT] ° ‘a 
dog p _ 27, 
02,00, È 
and (5.23) 
Slop. PEP ль 42 ex 
С 282002 дад, T5 5) = 1, 2 
In this case rank (а; j, ja; 0%, 9) < 1 ... (5.24) 
| 22 
nd 4а; 1,2;0%,0%) = | 00% T 
ar y habs D = | 1 (5.25) 
| 0 0 | 


EE EE Суон 
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Therefore, we obtain 


р(х) = p) = 1 = e interior A; ... (5.26) 


n fact (5.26) holds on interior А; for i = 6, 7, 8 as well. If 


and it is easily seen that i 
and hence all the mixed partial derivatives 


xe Ag, then p(x; 0) reduces to a constant 
thus, for x c interior A, we have that 


vanish; 
plx) = p(x) = 0. (5.27) 
We may thus conclude that 
8 
воо = U bady. A; 
{=з 
On the other hand, an immediate calculation gives us that 
8 
pile) «pe 80 bndry. 4, (5.28) 
(5.29) 


" 
and therefore R—9-— Y bndry. A; 
That the Lebesgue measure of Q—R is 0 follows from the fact that Q—R is a linear 


set and thus has 0 planar measure. 


Finally we define 


| i 
т(к)= 4 2 x c interior As Û interior Ав „. (5.30) 
az ж e interior A, U interior A, 


а c interior As 


d we note that for жє В with the proper 


Р Я Lebesgue) on 9 an д ioni 
This defines AC x T b p Ti) defined by (5.30) is the statistic constructed, up 


choices of ji; J2 Я ү Theorem 3. 
to additive constants, — is ToU 


for the family 7 
dimension everywhere on R. 
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THE FAMILY OF ANCILLARY STATISTICS 


By D. BASU 
Indian Statistical Institute, Calcutta 


SUMMARY. Though the marginal distributions of the ancillary statistics are independent of 
the parameter they are not useless or informationless. A set of ancillaries may sometimes summarise the 
A classification of the ancillaries in terms of the partial 
order of their information content is attempted here. In general there are many maximal ancillaries. 
Among the minimal ancillaries there exists a unique largest one. When there exists a complete sufficient 
statistic, the problem of tracking down the maximal and minimal ancillaries becomes greatly simplified. 


whole of the information contained in the sample. 


1. INTRODUCTION 


An ancillary! statistic is one whose distribution is the same for all possible 
values of the unknown parameter. A statistic that is not ancillary may be called 
‘informative’. The classical example of an ancillary statistic is the following : 
Example (a): Let X and Y be two positive valued random variables with 
the joint density function 


EN 
J(u, у) =e е x >0,y > 0,0 > 0. 


Here F = XY is an ancillary statistic. The maximum likelihood estimator 


T = V/Y/X of 0 is not a sufficient statistic. However, the pair (F, T) is jointly 


sufficient. 

The above ex 
mation about the p 
see, need not be informative—may supply valuable 
In the following example we have given a 
alent to the whole sample. 


ample shows that though an ancillary statistic, by itself, fails to 
provide any infor arameter, yet in conjunction with another statis- 
a 
tic—which, as we shall presently 
information? about the parameter. 


family of ancillary statistics that are jointly equiv 
| E ple (b) : and Y be independent normal variables with unknown 
xam : 


р aba Here X— Y is an ancillary statistic. It is 
m ; nd un А NE : s $ 
neans ^o believed tha (in this situation) is necessarily a 
commonly репеу е 


fonction of X— Y- That, however, is not true. 


x—Y if X4Y «c 


Let X 
andard deviations. 
t every ancillary statistic 


В Е gear ge det eo 


where c is a fixed constant. 
a The name ‘ancillary’ is due to Fisher 
en more 


The name ‘distribution-free’ is also in use and 


(1925). 
the present context. 


ropriate in 
perhaps would have be appropriate аена - | 
р ow the ancillary information may (according to Fisher) 


1956) for а discussion of h 


2 Seo Fisher ( 


be recovered. 
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Since X—Y and Y—X are identically distributed and each is independent 
of X + У it at once follows that F, is independent of X-- Y and has the same distri- 
bution as that of X— Y. Thus, F, is ancillary for each c. Consider now the family 
(Fj,—o«c«o of ancillary statistics. For fixed X and Y the different values 
of F, (for varying c) are either X—Y or Y—X. The value cy of c where F, changes 
sign (F, does not change sign only if X— У = 0 and that is a null event) is the value 
of X--Y. Thus, given F, (X. Y) for all c we can find Х--У and X— Y. Hence, the 
family {F} of ancillary statistics is equivalent to the whole sample (X, Y). The 
countable family {F,} where c runs through the set of rational numbers is easily seen 
to be also equivalent to (X, Y). 


The author (Basu ; 1955, 1958) has shown that, under very mild restrictions, 
any statistic independent of a sufficient statistic is ancillary and that the converse 
proposition is also true, provided the sufficient statistic is complete. 


In Example (b) the statistic T — X--Y is a complete sufficient statistic. 
A statistic F can, therefore, be ancillary if and only if F is independent of T. The 
following is a general method for constructing statistics independent of T. Start 
with any ancillary statistic F. In general, there will be many measure-preserving 
transformations of F (i.e. a mapping e of the range space of F into itself such that 
e(F) and F are identically distributed). For each real t, define a measure-preserving 
transformation e, of F. Then, take the statistic e, (F). Subject to some ie ABD 
restrictions, фр (F) will be independent of 7 and hence will be ancillary. In Example 
(b) we took F = X—Y and ¢, (Е) = Р or —F according as t <c or > c. 


If a statistic F is ancillary then every (measurable) function of F is also anci- 
Пагу. The statistic F, is said to include (or be more informative than) the statistic 
F, if F, can be expressed as a function of F, In this case we write F p. or 
F,CF,. Two statistics are said to be equivalent if each can be expressed Me a fuii 
tion of the other. 


| Ехатріе (je LetbXyXaoX, ben independent observations on a normal 
variable with mean 0 and s.d. unity. Then each of the n—1 statistics 


Е, = XX, Ё„=(Х,—Х„ Х,—Х,), ... Fray = О ИЕ ых. хх) 
Ы 1 т 
is ancillary апа 
ВСЕ Gs CP. 


The two ancillary statistics F, , and F —(X,—X, X,—X 
? ох аш 


seen to be equivalent. .. X,—X,) are easily 


From Example (Б) it is obvious that F,_, does not include all ancillary statistics 


An ancillary statistic И is said to be ‘maximal’ if there exisismumon-aaivalent 

ancillary M* such that M C M*. Thus, given any ancillary Р, either it i imal 
: В ы , § X: 

or there exists an ancillary F* DF. Given any ancillary Fo, there exists а 2) 
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a maximal ancill: 

E i al ancillary M D Fy. In general there exists many non-equivalent maxi 

€ апе: ;pie: rty ww 

Е cillaries. A typical property (Cor. to Theorem 4) ofa maximal ancillary M i ae 
or any ancillary F not included in M, the pair (М, F) is informative 97: 


А minimal ancillary is one that is included in every maximal ancillary. А: 
| à HD те к 3 ; . Amon 
the class of minimal ancillaries there exists (Theorem 5) a unique om d 
In the absence of ава 
у се of a better name à i ; 0 
а ab name we prefer to call G, the laminal ancillary. @, includes 
ancillary and is included in every maximal ancillary. А typical 


every minimal 
) of a minimal ancillary G is that, for any ancillary F, the pair | 
D 2 а 


property (Theorem 6 
(G, F) is ancillary. 
a complete sufficient statistic G, then, any ancillary statistic 
ally equivalent to the whole sample, is shown 
Under some further restrictions, the laminal 


Tf there exists 
F, such that the pair (б, F) is essenti: 
to be essentially maximal. 
n (Theorem 8) to be essentially equivalent to a constant. 
aborate on the above sketch of the family-tree 
of exposition we use the 


(Theorem 7) 
„11°. . 
ancillary is show 


In the following sections we el 
For the sake of elegance and brevity 
Reference may be made to Bahadur (1954, 1955) for ex- 


c-field approach. 


of ancillary statistics. 
language of sub o-fields. 
cellent expositions of the sub 


2, DEFINITIONS 


pace and let {Ру}, 0 € Q be a family 
а sub c-field 8, C 8. TIn- 


he present context) to deal 


an arbitrary measurable s 
8. Any statistic T induces 
ties it is more convenient (in t 


Let (X, 4) be 
of probability measures on 
stead of dealing with statis 
with the corresponding sub o-fields. 


Definition 1: 
for all ûe Q. The fami 


aid to be ancillary if P,(A) is the same 
denoted by A. 


closed for complementation and 
is not closed for intersection (i.e. 


The event A € 4 is $ 
ly of all ancillary events is 


Tt is easy to check that the family 7 is 
countable disjoint unions. However, in general A 
Jf is not а o-field). 

In order to show that the family J£ in Example (b) do not constitute a o-field, 
we have only to check that 


y > 0 and F, (X, Y) > g= Pj KF > Oand Х+У <e) 


Р, x= 

i f 1 —1(@—20)% q 
= سے‎ e 20 )- Ж 

+ 

Еа 
which varies with 0. 
In Example (b) the Borel-extension of A is 4. 
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` Example (d): Let Æ consist of the three points а, b and c and let the corres- 
ponding probability measures be 1—0, 4, and 14-0 respectively, where 0 — 0 — }. 
Here J£ consists of the four sets ф, [b], [а, c] and @ and so A is a sub o-field of 4. 


Definition 2: А c-field 7 is said to be ancillary if FC Æ. А o-field that 
is not ancillary is called informative. 


A statistic is ancillary or informative according as the corresponding o-field 
is so. 


Definition 3: Two ancillary sets A and B are said to conform if AB is also 
ancillary. If A conforms to B then we write А ~ B. Since Р, (AB)+P, (АВ’) 
= P, (A) it follows that A ~ B if and only if A ~ В’. 


If A conforms to every one of a sequence of disjoint sets B,, B, ... then it is 
easy to check that A ~ U B;. 


Definition 4 : 


Let Г, be the family of all ancillary sets B such that B~ A 
for all A e A. 


Clearly ф and Œ belong to Г. From what we have said before it follows 
that Гу is closed for complementation and countable disjoint unions. 


Theorem 1: The family Y, is а o-field. 


Proof: It is enough to show that T, is closed for intersection. Let B, and 
B, both belong to Гу and let A e g. From B, є Г, it follows that В,А є Jf. 


From B; є Г, it then follows that В,В,А в A. Since А is an arbitrary ancillary set, it 
follows that B,B, e To. 


We shall later on see that the ancillary o-field Г 


o corresponds to the laminal 
ancillary Go that we have referred to in $1. 


The family / of ancillary sets is a o-field if and only if every pair of ancillary 
sets conform to one another, ie. if Æ = T. 


Example (e): Let £ consist of the five points a, b, c, d and e with the 
corresponding probabilities $, 0, 0, 1—0 and 2—0 respectively, where 0 < 0 < 1. In 
this case Г, consists of the four sets ¢, [a], [b, c, d, е], and 42 The two sets [b, d] 


and [b, e] are both ancillary but they do not conform. Here Æ is wider than Г. and 
is not a o-field. п 


Definition 5: The ancillary o-field x, 


is said to include the ancill -field 
A (in symbols F, D A or 9, С A) ancillary o-fie 


if every element, of F is an element of &,. 


The above partial order on ancillary c-fields corresponds to the inclusion 
relationship for ancillary statistics. 


Definition 6: The ancillary o-field „7 is said to be maximal if there exists 
no other ancillary o-field ¢* such that £* D A. 
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Definition Т: The intersection of all the maximal ancillary c-fields is called 


the laminal ancillary. 
The laminal ancillary is the largest ancillary that is included in all maximal 


ancillaries. 


3. EXISTENCE AND CHARACTERIZATIONS OF MAXIMAL AND 
LAMINAL ANCILLARIES 


The following theorem is fundamental. 
there exists a maximal ancillary 


Theorem 2: Given any ancillary o-field Fy 


o-field MD Fy. 


Proof: We first prove that gi 
the inclusion relationship), 


ven any family (55), j € J of ancillary o-fields 
that are linearly ordered (by the Borel-extension ¥ of U 5, 


is also ancillary. 
Clearly, U 25; contains ф and & and is closed for complementation. Since 


{F} is linearly ordered it follows that U «95 is also closed for finite unions. That is, 

U ява field of sets. 
Since each %; is ancillary, the restriction of Ру to Û % is a measure Q that 
indamental Extension Theorem of measures 


does not depend on б. From the f 
(Kolmogorov, 1933) we know that the extension of Q to F is unique. 


Tt follows at once that the restriction of P, to F is the same for all 0, i.e. F 


is an ancillary c-field. 

Now let @ be the famil 
ponding to any linearly ordere 
includes every member of the sub-family 
a maximal element. 


Let {M} € I be the 


illary o-fields that include 5%. Since corres- 


y of all anc 
an ancillary c-field that 


d sub-family of € there exists 
it follows from Zorn's Lemma that @ has 


family of all maximal ancillary c-fields. We at once 


have the 

Theorem 8: A= U Mi | 

Proof : We have only to note that corresponding to any element A of there 
| ins A as an element and then apply Theorem 2. 


ary o-field that conta 
If {At} consists of 


ation where there 


only one o-field Mo then A = Ah = Го. 
are non-conforming ancillary sets, the 


exists an ancill 
Corollary : 
Thus, in any 
family {i} has at | 
In Example (4 


situ 


east two members. 
) there is à unique maximal ancillary. In Example (e) there 


aximal ancillaries namely : 
At, = the o-field spanned py [a] and [b, d] 
AM, = the o-field spanned by [a] and [b, е]. 
Theorem 4: If the ancillary set A does not belong to the maximal ancillary 
И then А does not conform to at least one element of M. 
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Proof: Suppose on the contrary that A cnforms to every element of „2. 
Consider the family * of sets AX U A'Y where X and Y are arbitrary elements 
of Jt. Clearly Jt C M* but not conversely. 


Since (AX UA'Y) = AX'U А’У’, 
and U(AX;UA'Y)— A (U X) U AU Y) 
and P, (AX U A'Y) = Py (AX)+P,y (A'Y), 


it follows that J/£* is also an ancillary c-field. 

This, however, contradicts the maximality of Jf. 

Corollary : If A be any maximal ancillary and if the ancillary o-field < is not 
included in AC then the smallest o-field containing both M and & is informative. 

Theorem 5: NA: = Г, 

Proof: Since every element of Г, conforms (by definition) to every ancillary 
event, if follows from Theorem 4 that Г, C J£; for all i, i.e. Го C (N 44). 

Now let Bef) „2; and A be an arbitrary ancillary set. From Theorem 3 it 
follows that Ae“, for some i. 

Hence B and A are together as elements of some Jf; and so В ~ A. 

Since A is arbitrary it follows that B e T,. 

(26) СТ, and so the equality is proved. 


Theorem 6: For any ancillary o-field Z the smallest o-field containing both 
F and T, is also ancillary. 


Proof: Consider the family of ‘rectangular’ sets X f) Y where Хе 
YcT, From the definition of Г, it follows that all such sets are ancillary and that 
they conform to one another. The family of sets that may be formed by finite unions 


of rectangular sets form a field of sets and each of them is ancillary. The rest fol- 
lows from the Extension Theorem of Measures, 


and 


4. WHEN A COMPLETE SUFFICIENT STATISTIC EXISTS 


In general there exist many maximal ancillaries. For instance. in Example 
(b) there are uncountably many maximal ancillaries. In order to ада this, let us 
consider the family {A,} of ancillary events where A, = (X, Y) | Fx Y) > 0 
If c < d, then ee, y 


Py (А. Аа) = P(X—Y > 0 and X+Y <c)+P(¥—X > 0 and X4Y d) 
= 


which varies with 0. 


Thus, the members of the family (A 
conforming. Hence the maximal ancillaries i 
family are all different. 


Di e ancillary sets are mutually non- 
neluding the different members of the 
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hough there may exist n any maximal a aries, it ч 

2 m is not, in ge 
Ti l y t y ancillari 5, 1n gi neral, easy 
a par ticular ancillary. However, in the situations where 


to prove the maximality of 
atistic, it is rather easy to demonstrate the maxima- 


we have a complete sufficient st 
lity of a large class of ancillaries. 

The following property of complete sufficient statistics is useful! Here 
ve the result in terms of o-fields. 
If $C & be a boundedly complete sufficient o-field 
ependent of £. 
ditional probability of A given £. That 


we state and pro 
Lemma (Basu, 1955) : 
and A any ancillary event, then A is ind 
Proof: Let ¢ = P(A|@) be the con 
is, e is à G-measurable function such that 
PAG) = ! edP, for all бє Q and Ge £. 


be chosen to be independent of 0. Also 


follows tha! e may 
erval (0, 1) is of zero-measure for each 


Since .G is sufficient, it 
(x) lies outside the int 


the set of x's for which ¢ 
069g. 


Taking @ = 4b we have 
Р, (А) = f еар, for all 0€ 2. 
8 


d eis G-measurable, it follows from the bounded 


Since P, (A) is independent of 0 an 
completeness of G that e = Ру (A) almost surely for all 0e0. 


p, (AG) = [ Po 


\ 


\ 


P, (А)Р, (G) for all бє Q and Geg. 


all Ge £ 


er we need a slightly wider definition of maximality for 


That is, A is independent of 
Before proceeding furth 

an ancillary o-field. 
A and B are said to be essentially 


Definition g: The two -measurable sets 


equal if 
P, (ААВ) = Po (AB' U A'B) 


— 0 for all 0c О. 


Definition 9 : Two sub c-fields F, and JA, are said to be essentially equivalent 
ntt po d & : 

m a to ву aê pelonging to one of them there exists an essentially equal 
ing ап, 


if correspon 
other. 


set belonging tO the 


Definition 10 : | 
maximal ancillary is called essential 
Theorem 7 : If $ be a boundedly comp 

| B B 8. 

& such that the Borel-extensvon of & Use? 


g field that is essentially equivalent to a 


y maximal. 
lete sufficient c-field then any ancillary 
sentially equivalent to 48, is essentially 


Any ancillary 


maximal. | | 
> ther interesting applicat ions. 
aig (1950) for some © 
55 Hogg and Craig ( 
1 See Basu (1955) and 
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Proof: Let Л be а maximal ancillary including and let M be an arbitrary 
element of 4. For proving the essential maximality of & we have to establish the 
existence of an F, € & such that Р, is essentially equal to M. 


Let Æ* be the Borel extension of F J $. Since 43* is essentially equivalent 
to 43, there exists ап M* є 43* such that M* is essentially equal to M. 


Since M e MD 7 and M* is essentially equal to М, it follows that M* is an 
ancillary set conforming to every Fe. Clearly, the two measures P and О on Z, . 


defined by the relations P(F) = P,(F) and Q(F) = P, (M*F), are both independent 
of 0. 


Therefore, the conditional probability function 


у dQ 
е = P(M*| ж) = 2» 
is independent of 0. 


Thus, e is an &-measurable function on O such that 


Pq (M*F) = [edP, for all бє and Fe &, 
Ё 


Let F and С be typical elements of © and . respectively. Sinco Fis ancillary 


and @ is boundedly complete sufficient, it follows (from the Lemma) that & and £ are 
independent. 


a edP, = f ($ Rp) Le dP, (Xy and Va are characteristic func- 
x tions of F and б) 

= ll 

& 


( Vr) dP, РА Я%аР, (°. SX and $ are independent) 


= Py (M*F) P, (G) ss '(@) 
Again, since M* ~ F it follows (from the Lemma) that M*P is independent of G. 
d Se dPy Ey PST) — PL (MAF) т, үп), (A) 


From (x) and (f) we have 


1 (9—Lys) ар, = 0 for all Fe & and Geg. 
Since Ф —.2у• is @*-теазига е it at once follows that 
I (e — Bys) dP, = 0 for all Be 8*. 


Therefore, for each 0 € О, 9(x)—Rye (a) 


— 0 for almost all a. 
Let Fy 


= {®|е(@) = 1}. Clearly Fo в & and is essentially equal to M*, 
Since M* is essentially equal to M the Theorem is proved, 
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In Example (b), X + Y is a complete sufficient statistic. Also for any fixed i 
(ХУ, F,) is equivalent to the sample (X, Y). Hence it follows pe d ч 

| hat every i 
In Example (c), the ancillary F,_, together vitis 
X,+...+X, is equivalent to the whole sample 
A large number of similar situations are 


an essentially maximal ancillary. 
the complete sufficient statistic X,+ 
and, therefore, is essentially maximal. 
covered by Theorem 7. 
Having partially 
attention to the laminal 
The laminal ancillary is the 1 


From Theorem 5 we 
set is the laminal ancillary. 


settled the question of maximal ancillaries let us turn o 
ur 


ancillary. 
argest ancillary c-field that is included in all 


maximal ancillaries. have that the class Гу of ancillary set; 
sets 


C that conform to ever 
Let A be the family 


y ancillary 
of sets that are essentially equal to either the empt; 

set ф or the whole space E. That is, Ais the family of all sets E such that Pg (E) A 
= 0 or == 1 for all 060. Tt is easy to check that A is a o-field e that 


The following theorem covers а number of important cases. 
ving conditions are satisfied then Гу = A 


either 


АС 
Theorem 8: Jf the follor 
y maximal ancillary. 


hich is independent of F. 
< 1 there exists F* € F such that 


i) & is an essential 
ii) There exists an informative set Gw 
iii) For every Fe & such that 0 < P, (F) 
P(E) = P; (Г) and P, (FF*) < PF). 

Let C be an arbitrary element of Го. 


t 0 — Рр (С) < 1. 
aximalancillary and C belongs to every maximal 


tially equal to C. Thus, Р conforms 


Proof : We have to prove that P, (C) 
zu Ji possible le 
quivalent to а т 
e exists Ре which is essen 
and 0 < Ру (F) <1. 
onditions 


Now, Fis essentially e 

ancillary. Henoe, ther 

to every ancillary set 
Let @ and p* satisfy © 

А UPC q'p*. Since G is independent of «9, we have 

В Е P, (@)Po (F)+Po (ФР, (Е*) 
= Ру (ЕР, (G)4- P5 (41 


=P, E: 


(ii) and (iii) respectively and let 


That is, А зай ancillary set. 


Now AF= GFU G(FF*) 

Р)-ЕР; (89Р, (ЕР) 

(ӨР; (F)—P, (F*)]. 

=P FF*) ате Bots independent of 0 and that the 
formative P,(@)is not independent of 0. Hence 
tion. Therefore, Pq (C) = 0 or 1, ie. Сел 


and, therefore, 
= p, (FF*)+Po 


Let us note that P, (FF*) and PAP) 
latter is not Zero. is in 


AF is informative, 
which proves the theorem. 
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If the conditions of Theorem 7 are satisfied then & and any informative G € G 
satisfies conditions (i) and (ii) of Theorem 8. We have then only to check whether 
condition (iii) is satisfied or not. If the restriction of Ру to Я be non-atomic then 
it is very easy to see that condition (iii) is also satisfied. 


In Examples (b) and (c) the (essentially) maximal ancillaries have continuous 
(non-atomic) distributions and so Theorem 8 holds. Most of the familiar cases where 
a complete sufficient statistic exists fall under the above category. 


Example (f): Let X be a single observation on a normal variable with mean 
zero and standard deviation с. Неге X? is a complete sufficient statistic. 


—1 of X <0 
Let Y= 


L ЧБ 0 
Here the pair (Y, X?) is equivalent to the whole sample X. 
Y is an essentially maximal ancillary. 


The sub c-field generated by Y consists of the four sets ф, (—oo, 0), [0, со) 
and J. Condition (iii) of Theorem 8 is clearly satisfied. Therefore, the laminal 
ancillary Г, is the same as A. 


ACKNOWLEDGEMENT 


I wish to thank Dr. R. R. Bahadur for some useful discussions. 


REFERENCES 
BAHADUR, В. В. (1954): Sufficiency and statistical decision functions. Ann, Math. Stat., 25, 493. 
(1955): Statistics and subfields. Ann. Math. Stat., 26, 490. 
Basu, D. (1955): On statistics independent of a complete sufficient statistics, Sankhya, 15, 377. 
Sankhya, 20, 223. 


, 22 


(1958) : On statistics independent of а sufficient. statistic, 


FISHER, К. А. (1925) : Theory of statistical estimation. Proc. Camb. Phil. Soc. ., 22, 700. 


(1956) : Statistical Methods and Scientific Inference, 
Hoca, В. V. and Craic, A. T. (1956) : 
17, 209. 


Oliver and Boyd, London. 


Sufficient statistics in elementary distribution theory. Sankhya, 


Korwoconov, A. М. (1933): Foundations of The Theory of Probabilit Y, 


Chelsea Publishing Company;‏ ا 
ew York.‏ 


Paper received : July, 1959. 


256 


— yy 
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A LINEAR FACTOR ANALYSIS | 
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The Israel Institute of Applied Social Research and 
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SUMMARY. Since the metrics of observed test scores are usually arbitrary, Thurstone posed 
the problem of how to “factor” them by using only rank-order considerations. One form of solution is to 
seek transformations that will yield new scores with a correlation matrix that is best from some point of 
view of factor analysis. But analyzing data via their correlation matrix is justified stochastically only 
if the regressions are linear. Assuming only the linearity restriction on regressions, it is shown that—in 
general—at most one set of new scores can be found to maintain the observed rank-orders. The factor- 
dom to mould the new correlation matrix by further considerations. More generally, 
which will yield linear regressions, the new scores possibly 
the original rank order. In some cases, polytone 
computations are outlined for all ways of 


analyst has no free: 
it is shown how to compute all scoring systems 
as monotone relations with 


having polytone as well 
Similarly, 


mations are the more appropriate ones. 


transfor 
+ will yield linear regressions. 


metricizing unordered data tha 
1. INTRODUCTION 

of rank-order. Thurstone (1947, pp. xiii-xiv) 

served rank-orders in terms of having 

d. A modification of this problem— 


The modified Thurstone problem 
oblem of factor analysis of ob 


also the underlying factors only rank-ordere 

which may in some cases be equivalent to Thurstone’s more general statement —is 

as follows. For a set (ie.. population) P of subjects and a set J of tests, let s;, 

be the observed score of subject p on test j (jpeJP)? The metrics of these observed 

scores are arbitrary up to monotone transformations within each test; only the rank- 

order of the s;, for each j has fixed a priori meaning. Find those new scores a5, (jp e JP) 
ation matrix possessing 


which will yield а correl: properties that are most desirable 
t of view of common-factor analysis, but which preserve the observed 


posed the pr 


from some poin 
rank-orders within tests, OT 
sign (as — tia) — sign (E (jpg € SPP): say) 


product-moment correlation coefficient between 


Let 7 denote the ordinary 
the desired new scores on tests j and k (jk € JJ), and let R denote the new correlation 
matrix, | 
R= [ry] ke od] 


t to the author from the Ford Foundation. 

or a profile over sets J and Р in the indicated 
and P, or the set of all possible profiles 
“facet” notation and terminology is con- 
ble a compact but complete statement and 


1 This research was facilitated by oe uncommitted E. 
2 By jp here is meant the ordered. pair of elements 7 ea 4 

order. By JP is meant the Cartesian product of sets 

P. This type of 


. ] d pt 

of the form , where j EJT an к 1 
nb. ro thor multivariate problems, making possib'e 8 ; 
and in 0 JPP in equation (1.1) below denotes a triple Cartesian 


analysis of a complex hia To pcm S 2 apache PUT اکر‎ 
sible profile: > aD С 

ар я к: CE. гн и they are components of jp. Similarly, j, p, and q are the 

Ко. es Pate ўра. J and P are called the facets of JP and JPP, the same P serving as 

components of pro . 

two facets in tho latter 045° 


venient here 


257 


Vou. 21] ЗАМКНУА : THE INDIAN JOURNAL OF STATISTICS [Parts 3 & 4 


The main diagonal elements of R are all unity, expressing complete self-correlations, 


t= 1 (jeJ). ... (1.3) 


Various criteria for E are possible. Thurstone himself, and many others, 
would prefer an В whose main diagonal could be modified to yield small rank for the 
reduced matrix (but keeping the Gramian property), to allow for specific and error 
factors (Thurstone, 1947). Others might seek an R parsimonious in a structural 
sense different from that of small rank, but also possibly modifying the main diagonal, 
(Guttman, 1954a, 1958a, 1958b). Still others would prefer not to modify the 
main diagonal, but to specify some structure for R as it is. 


Previous investigators have tackled at least two different aspects of the 
rank-order factoring problem. Taking minimum rank for R as a desideratum, 
Bennett (1956) has developed an interesting algorithm for a lower bound to this mini- 
mum, in terms of absence of certain rank-order patterns among the observed scores. 


Since an actual R is his point of departure, Bennett's results refer to the modified 
Thurstone problem, 


Using another point of departure, Guttman (1946) derived heuristic equations 
for direct calculation of 75, from the observed rank-orders, without pivoting on В, 
in terms of a smaller set of scores (principal components) which would tend to repro- 
duce the observed rank-orders without assuming linear 


regressions. While the 
smaller set appears in metric form, actually only its rank 


-orders matter, and this 
solution may fit more closely into Thurstone’s general formulation of the problem. 


2. LINEARITY OF REGRESSION 


It would seem, however, that in many cases, a linear factor analysis of an 
Е would provide the best possible answer, 

among the scores being factored. In the ca 
tion of the rj, is stochastically justified, and R represents the actual interdependence 
of the new scores. Indeed, possible lack of linearit 


observed s;, seems to be part of the origin 
problem. 


especially if the regressions were linear 
se of such linear regressions, computa- 


У of regressions among the 
al motivation for Thurstone to have posed his 


Even more restrictive stochastic conditions would be posited by Darmois’ 
“general” factor analysis (Darmois, 1956) or by Lazarsfeld’s (1950) latent structure 
theory. 


In the present paper, an analysis of Thurstone’s problem will be made, using 
only the stochastic condition of linearity of regressi 


: 3 on. This condition turns out 
to be so restrictive, that in general at most one R са 


| n be found to satisfy it and (1.1) 
simultaneously. This leaves no room for considering further desiderata of schools of 


1The treatment (Guttman, 1946) is nominally for the case 


a where judges do the ranking of objects; 
this is formally the same as our present problem if “judges” are inte 


3: = preted to be our tests J, and “objects” 
are our subjects P. 
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factor analysis. The equations for computing the uniquely determined (up to linear 
transformations) %;,—if they exist at all—are given in $$ 4-5 below. There is also 
the possibility that no satisfactory 25; exist, or there is no solution to Thurstone’s 


problem for a given set of data. 
Solution of this version of Thurstone’s problem is simplified by first consider- 


ing and solving a more general problem. We actually study all possible transformations 


of the sj, polytone as well as monotone. Consequently, the sj, need not be real 
numbers at all, nor even express rank-order. They may denote arbitrary qualitative 
categories. 


ed in this paper is : Can real numbers be assigned 


The general problem solv 
a given population P in such a way that the result- 


to given qualitative categories for 
wil have linear regressions on each other? If yes, in how 


and what are they? Should à positive answer be found 
“factor analyzing" them via a product-moment 
alitative or non-linear quantitative theories 
facet analysis, etc., are called for. 


ing numerical variables 
many ways can this be done, 
for the given data, this may justify 
correlation matrix. Otherwise, the qu 
of scale analysis, latent structure analysis, 
stone’s problem is the special case where the 


In this broader context, Thur: 
dition (1.1) can be considered as well. 


categories are rank-orders, so that con 


3. NOTATION FOR GROUPED DATA 

Let 4; be the set of all categories into which test j classifies members of 
P(jeJ). A typical category (i.e., element) of A; will be denoted by a, and sometimes 
by b. This “dummy” notation for categories requires reference to some set for 
meaning, and such re ys be made. Let f, be the proportion of P 
that falls into a (a € Aj: J gories containing a positive 
proportion of P, or shall assume that 


ference will alwa: 
e J). We shall consider only cate 


>0 @ edp jeJ) (3.1) 


We shall also assume that, for each j, the categories of A; are mutually exclu- 


sive and exhaustive, 80 that 


\ 


>. һ=1 Ger ... (3.2) 


ағАј 

ie., the number of elements of J. Let 
For Thurstone's problem, for each j 
m 1 to mj, or—alternatively—the 


Let m be the total number of tests, 

m, be the number of categories in A; (je 9): 
th 1 4s of А, сап be simply the integers fro E 
i Tied ranks are explicitly allowed within 


А В А -intervals for the 5. . : 
midpoints of m; pe uon test 1 test—features which are quite prevalent in 
may Væ 


a test, and 2%; 5 2 
ae More generally, no @ prior eed exist among the element of A; for 
any j; they may be arbitrary qualities: 
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While we assume that m and each of the m, are finite, no assumption will be 
made about the size of population Р: it may be finite or infinite. Stochastically, 
the analysis makes most sense if P has an infinite number of members, for we shall 
not discuss sampling error due to selecting a subset of subjects from a larger popu- 
lation. Proportions such as f, will be treated as if they had no sampling error. As 
far as the pure algebra goes, however, P below could be finite. The reader may 
make his own interpretation; this will not affect the formulae themselves. 


To say that subject p has observed value Sj, On test j is to say that Sj, = d, 
where a is some category of A;. Indeed, for fixed j, f, is simply the relative frequency 
over P with which the equality s,, = a holds. Similarly, to say that subject р receives 
new score zj, on test j is to say that, for this j, numerical value y, is assigned to а, and 
® = y, Whenever s = a. To talk in terms of Sj, and wj, is to talk about the “un- 
grouped" data, while categories a and scores y, permit treatment of the data in 
"grouped" form. 


A special case of "grouping" of course is where P is finite, with n members, 


and f, == l[n, or each category of test j contains only one subject. This includes 
the case of untied rank-orders. f 


The correlation coefficients in R will not change if means and variances of 
the z;, are changed. So there is no loss in generality in setting the means equal 
to 0 and the variances equal to 1.. In grouped data form, this implies 


>) $49 (ed) =. 0818) 
аг А; 
and >, fii =1 (ел. ... (3.4) 
у B aeA; 


For fixed jk, let f,, be the joint relative frequency of categories a and b 
(abe А; А,). As usual, marginal frequencies are obtain: 


4 able by summing over joint 
frequencies, or 


f= >, fa (Q6A,jkeJJ). ... (8.5) 


beAy 


By the usual product-moment formula for grouped data, recalling (3.3) and (3.4), 


Tg = EA Wu (jke JJ), ... (3.6) 
ab А; А 


Clearly, if we consider two categories from the Same test, or j = k 
f f Ја if a= b 
unt agp Me 45jeJ, o (87) 
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so the right member of (3.6) equals the left member of (3.4) when j = k, verifying 


(1.3). 
Notice that for given jk, where jh, (3.6) shows ту is a bilinear form over 


the fj. 
4. 'THE BIVARIATE CONTINGENCY MATRICES 


For the bivariate case m = 2, the basic formulae and results to be presented 
now (in $$ 4-6) have been rediscovered independently by various writers during the 
past two decades! (cf. Bennett, 1956; Burt, 1950; Guttman, 1941, 1953a and 1953b; 
Hirschfeld, 1935; Maung, 1942; Williams, 1952). Despite their relative simplicity 
and their importance in a wide variety of situations, such formulae have not yet 
attracted the general attention they seem to deserve. Our present task is merely 
to extend them to the multivariate case. But even when m > 2, the bivariate 
regressions must be linear as well as the multiple regressions. A great deal of our 
work is accomplished by considering first all the bivariate regressions among the m 
tests. 
the fy» represent а contingency table, or matrix, of 
The elements of A; are the row captions, and the elements 
The problem is to replace these captions by real 
he resulting numerical regressions will be 


For each pair of tests jk, 


mj TOWS and m, columns. 


of A, are the column captions. 
numbers, y, and y, (ab c A; Aj), so that t 


linear. 

Linearity of regression of the a, on the trp Вау, is defined by considering, 
for each b in turn (b € Az), the arithmetic mean of the cj, for all р whose z;,, equals 
yy These conditional means must lie on a straight line when ше y, are abscissas, 
with slope 7. (The slope is more generally the regression coefficient, but this equals 


when variances are set equal to unity, as we have assumed). Expressing the regres- 
the linearity condition is 


Ty 
sion of the 2, ОП the 23 in grouped data form, 
EATEN LL ш A 
Л ДУЛ) 


The left member of (4.1) is the regression estimate of j, as a linear function of 
a; yi while the right member is the direct statement of the conditional mean of 
kp — Yoo 
П 


the a, for fixed b (or Yo): 
The converse condition for linearity of regression, for that of the 2, on the 


Kips js—analogous to (4.1) 
1 Е 
Tt Ya = f. ` Yofab (a € Áj jke JJ). s (42) 
* be Ax 


> nature of the notation, (4.2) is strictly equivalent to (4.1). 


bos both forms. 


B of the 
ess $ to be able to refer to 


However, ib is conventen 
supplying me with three of these references. 


1 Таш indebted to Dr. William Kruskal for 
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5. THE NUMBER OF SOLUTIONS IN THE BIVARIATE CASE 


Multiplying (4.1) through by f,y,, summing over be Ap and recalling (3.4) 
serve to verify (3.6). Similar verification comes from multiplying (4.2) through by 
fj, and summing over ae А. Indeed, for fixed jk, (4.1) and (4.2) are the stationary 
equations for determining the у, and y, that will maximize (minimize) ть in (3.6), subject 
to restraint (3.4). This has been discovered by several of the writers cited above, 


by differentiating the right member of (3.6) subject to condition (3.4). Further 
known theorems are as follows. 


For fixed jk, the number of linearly independent solutions to (4.1) and (4.2) 


is always equal to the rank of the matrix [f]. АП but one of these solutions will also 
satisfy (3.3). The one improper solution is always y, = y; = 1, for which rp d. 


Thus, the number of linearly independent proper solutions is one less than the rank 
of [fa]. To eliminate the one improper solution, define fu by 


fa = ЈК (abe AjA, jhe JJ), 2. (Б1) 


and use fo in place of fa, in (4.1) and (4.2). The rank of [ fasl is one less than the rank 
of [fu], and every solution of the new equations will be a solution of the old equations, 
and will also satisfy (3.3). /;, is simply Ё, with “chance expectation” removed as in 
computations for the chi-square test of significance for a contingency table. 


6. RELATION TO CHI-SQUARE 


There is an intimate relation with the entire chi-square theory. For fixed 


jk, let ру, be the rank of [f;,], so that Ру. 1 is the rank of [Л]. Let у’ and ур be the 
p-th proper solution to (4.1) and (4.2), with resulting correlation Tj (P = 1,2, ..., pp) 
Then rf, 4 0 always, and i 


pjk 


X (je? = Vie <1 (jhe JJ), 2. (6.1) 
p=1 


where уй, їз Karl Pearson's mean square contingency coefficient 
, 


=4 Qf, 
yh ris. ae (Белл). ... (6.2) 


If P were finite, say with n members, and if xh 


. : jk Were computed in the usual 
fashion for the contingency matrix [f,,,] between tests j and k, then №, = ny}, (jk e JJ). 
ЈЕ jk 


This helps explain why x5, is often not a very sensitive test for statistical independence; 
it crudely averages contributions of p;, possibly different Sources of dependence with- 
out regard to the possible structure of dependence (Guttman, 1953b and Williams, 
1952), that is, without specifying alternatives to the null h 


У M ypothesis of independence 
as is required, say, in the Neyman-Pearson theory of testing hypotheses. 
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While the different solutions may not actually be statisti i 
E оны and hence not necessarily constitute сна a on 
a ys mutually orthogonal, or linearly uncorrelated if their latent e 
all unequal. More explicitly, for fixed jk, let yz and y? be the respective E we = 
to a (ae Aj) by solutions р and р’, yielding correlations r$, and rf » отан 
Tf these two correlation coefficients are unequal in absolute КТ 1 БЫ. nd 


case when p 52 p'—then 


T Рау =0 (р 9р7). ... (6.3) 
аА; | 


The above known facts will now be used to obtain new results, relevant to 


our present problems. 


7. NUMBER OF SOLUTIONS MAINTAINING THE OBSERVED RANK-ORDER 


How to solve (4.1) and (4.2) for fixed jk is well-known. For each p, an itera- 
go from (4.1) to (4.2) and back again; or else (4.1) can be 


tive procedure can be used to 
and then iterations can be performed for the y, 
a 


substituted in (4.2) to eliminate the yj; 
alone from the resulting equations. 


--1 cannot exceed the smaller of m; or Mp, since the rank of a matrix 


s row or its column order. So the smaller of m,—1 and m,—1 
e number of solutions to the bivariate case. In the multivariate 
est of the m; (j € J) Then mg—1 is an upper bound to the 
in which regressions of the з; can be linearized. 

f the tests provides only a dichotomous classifica- 


ion to Thurstone's problem, as well as to the 


tion for P, then there is at most one soluti 
problem of metricizing unordered qualitative data. Only if m > 2 


an one solution. 

2, p can possibly take on more than one value; never- 
ain can have but one solution in general. In general 
ystems will satisfy (1.1). To prove this, 
en to a (a e 4j) by solutions р and p', 


Now, Pjr 
cannot exceed either it 
is an upper bound to th 
case, let то be the small 
number of different ways 


If m, = 2, or at least one o 


more general 
is there room for more th 


If то 2 and т = 
Thurstone’s problem ag 
he linearized regression 8; 
the respective scores giv 


theless, 
at most one of t 
again let Ya and yg’ be 
and let 0; be defined by 


SET (e 
abeAjAj 


(7.1) 


а and b in (7.1) are elements of the same А, 


Ь are distinct categories O 
р’, then 


(a + b, ab e A; Ay, jeJ). ses o2) 


Notice that both 


Now, На and 
rank-order (1.1); whether scored by p or by 


f A; that maintain the original 


(y=) @—%#) > 0 
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From (3.1), (7.2), and (7.1), we obtain 


0,0 (jeJ). £ 1 


But expandin g the right member of (7.1) with the aid of (3.2), (3.3), and (6.3) yields, 
in general, 


0; =0 (jeJ), ... (7A) 


which contradicts (7.3). Therefore (7.2) cannot hold in general. No two distinct 
solutions can maintain the same rank-order (apart from the exceptional case of equal 
latent roots) for categories of the same test. 


If there is a regression linearizing solution to Thurstone’s problem, it is unique 
in general. In particular, if the original s, already have linear regressions, there is 
no profit in general in trying to transform them into new scores if condition (1.1) is to be 
satisfied. ] 


Interesting empirical examples have been reported in (Guttman, 1953b), 
where condition (1.1) was deliberately abandoned to obtain psychologically more 
appropriate polytone transformations of rank-orders. For fixed jk, these linearized 


the regressions to maximize; This type of situation may prove to be fairly frequent °° 
with attitudinal data, should investigators begin to look into their results for possible 


polytone properties. 


8. THE MULTIVARIATE CASE 


Thus far, we have focussed on solutions to (4.1) and (4.2) for fixed jk. But 


there is no guarantee that if scores y, (a € Aj) are a solution for the joint distribution 


of test j with test b, they will remain a solution for test j with test i where k + i. 
For the case m > 2, the y, (a e A;) must linearize regressions simultaneously with all 
other tests if they are to be part of a solution for the multivariate case. If no scores 
ya € А;) do such a job with all tests, then there is no solution at all to the regression 
linearizing problem, whether or not condition (1.1) is to be co 


nsidered, and whether 
or not the data are initially unordered. 


A way of testing the simultaneous linearizin: 
y, and y; (ab € A; Aj) for some fixed jk, and then comp 
fixed scores by using the latter in the right of (4.1) and (4.2) respectively, but writing 
c and r; in place of b and r; in (4.1) and writing c and 7 in place of а and ry, in 
(4.2). The resulting left members should be numerically equal for all i e J. , 


The above is also a way of avoiding repeatedly to solve 
various jk. Once solved for fixed jk, solutions for other 
directly from these if the simultaneous linearizing property 


5 property is to compute the 
ute y, (сє A;) from each of the 


(4.1) and (4.2) for 
pairs of tests can be got 
holds. 

But even all this concerns only bivariate Tegressions, and. gives no assurance 
; х Е we example, if all test scores were dichoto- 
mous (m; = 2, je J), simultaneous linearizing always holds, since a straight line 
can always be fitted to two points; there is always one and only one R (apart from 


about linearity of multiple regressions. 
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reflections of sign) that is proper from the bivariate distributions, namely that of 
point correlations from fourfold tables. But this R says little or nothing in general 
about the shape of the multiple regressions among the dichotomies. 

Study of the multiple regressions cannot, of course, introduce new solutions 
ained by the bivariate considerations of (4.1) and (4.2). What it 


beyond those obt 
For example, the unique solutions to 


can do is to eliminate some or all of the latter. 
(4.1) and (4.2) for dichotomies need not yield linear multiple regressions. In such 
a case, a “factor analysis” of the of point correlations is inappropriate; nonlinear 


techniques are needed that go beyond R. 


9. FORMULAE FOR THE MULTIPLE REGRESSIONS 


To state the multiple regression conditions the y, (a € Aj, je J) must satisfy 


beyond (4.1) and (4.2); more notation is needed. 
Let Jj, be the subset of J defined by omitting tests j and k, where j # k. 
Let C, be the Cartesian product of the sets of categories of the tests in Ул, 
" 


O, = [| 4 ° 0626677 
Феде 


and let c be à typical element of Су» i.e. 
fu, be the proportion of joint occurrences © 
bivariate contingencies are given by 


(9.1) 


, a profile over all tests except j and k. Let 
f abc (abc € А;АОзь jkeJJ). Then the 


fo = > fave (e A; Aj jke JJ). (9.2) 
ceCjk 
Similarly, if foc 18 the proportion of joint occurrences ошо then 
f= » fave (bc € АС» Же JJ). (9.3) 
aeAj 
be the redicted value of y, OP the j-th test for à Subject whose profile 
Let yy, DO V Y tests is be, according to some metricization of the original 
on the m—1 remaining files for which the members of (9.3) do not 


scores (abe € Aj Ax б). Then for thie pro 


vanish, 
(9.4) 


1 (bc € Ay Ср Же JJ), 
Vig TE Ya Save 
h onditional mean of the y, for the given profile. тану (9.5) Gennes 
: 7 а : 
„18 the whether or not they are linear. 


or y 
sions, 


the true regres 
If yj, denotes the estimate of 

c 
i tests (abe € Ay А, Cg); 


lifaisa component of bc (abo e A; А, Cg, jhe TIS). ... (9.8) 


ear function of the scores on the 


y, а5 ® lin 
pressed as follows. Let 


it can be ex 


remaining ? 


Sate = Q otherwise 
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Clearly, if = j or j = k, б = 0 in (9.5) for cj, has no component from test 4 by 
definition, and is not defined if j =k. Let wj, be the multiple regression coefficient 
of test kin the linear regression of test j on the remaining m—1 tests (jb e JJ). Аз 
usual, w; = 0, for a test is not used to predict itself. The wj. can be computed 
directly from R, and most simply from R-! when R is non-singular. If Ris singular, 
the wj, are not uniquely determined, but— аз is well known —the predictions them- 
selves are invariant with respect to choice of equally good wj, and can be stated as 

Wem D. У Ya bate, (А, Og, jbe JJ). ... (9.0) 

ieJ aed; 

In the right member of (9.6), the summation over a e А; picks out the scores of the 
predicting tests which are associated with the profile bc, and then the summation 
over ic J simply weights these by the regression coefficients and adds the products. 


The linearity requirement for multiple regression is th 


аб y, = уш, or from 
(9.4) and (9.6), 
y Yalan = f; >, bi Ya Sate Ш; (bee Ay Cy, jee JJ). ... (9.7) 
aed; 


ieJ aed; 
In the form (9.7), we need not worr 


y about the vanishing of members of (9.3), for then 
both members of (9.7) 


vanish by virtue of (9.7) and non-negativeness of proportions. 
To check (9.7) with empirical data may usually be prohibitive, 
check which may be serviceable in practice, one с 
condition derived from (9.7) and (4.1). 


As a partial 


Let fi = >) fibus (Фед, Ai, bc JIJ), 2. (9.8) 
ceO ji; 


Then summing (9.7) over сє О and using (9.2), (4.1), and (9.8) yield 


= S y Ya Ijab 0 (0 € Ar jk € JJ). 
ieJ aed; 


To use (9.9), first employ (4.1) and (4.2) for determinin 
variable, compute the resulting В and w, 


у» tabulate the jabs 
together as required by (9.9). If (9.9) is not satisfied, the 
not linear for this metricization, 


(9.9) 


g the y, for each 
and see if all these hang 
multiple regressions are 
Even weaker, but com 


putationally more feasible, necess 
be derived by summing the reg 


i ary conditions ean 
pective members of (9.9) over jor 


in other manners. 
10. RELATION TO OTHER STOCHASTIC CONSIDERATIONS 


tr 
fa 


The importance of considering the nature of the true regressions when comput- 
ing product-moment correlations is illustrated by the fact that zero пока does 
not necessarily imply statistical independence, Consider а Sanes, poche 
perfect regression of one variable on another: the Correlation ratio is1 while 
correlation coefficient is 0, The perfect s 


the linear 
is obscured by the linear analysis. 


dependence of Опе variable on the other 
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e theorems of the various forms of linear factor 
analysis are non-statistical, for abstract Euclidean vector spaces, 
s stochastic interpretation. The sub- 

not in linear algebra. 


thei ; Б 2 б 
it use in the behavioral sciences require 
antive interest is in actual interdependencies, 
set hy regression equations linear 


— To what extent is interdependence EXP 
therwise ? 207" 4 i Ў 
efficionts амі? 1 ә correlation ratios wore xR intend of conset. r 
s. d these always tell the whole story ? ‘The answer to the second question. 


No. A zero correlation ree 
. зато cree cab 4 dus 
But whil re hebes that conditional meaus Cl eue. veut. 
ut while the means are constant dependence ma, 
3 E ^ ‚у occur 


While many—if not most—of th 
general theorems 


is : 
on another do not vary. 


in the form of heteroscedasticity, 
A complete analysis should account for all forms of dependence. This is the 
motivation of both Darmois (1956) and Lazarsfeld (1950) in their respective 


approaches. 
Our present analysis shows th 


varying skewnesses, ete. 


at the criterion of linear regressions is so res- 


trictive by itself that it is not very hopeful that other criteria could often be satisfied 
as well in metricizing data, The multivariate normal distribution is exceptional 
in that it is so well behaved : zero linear correlation implies complete statistical 


independence. 
Tf motricizing cannot 


best thing for many purposes is 
How to seek the simplest linear system has been the topic of this paper. 


Some writers seem to imply that if the sj, are transformed so as to make 
their marginal distributions normal, then the new bivariate or multivariate distributions 
will be normal. Unfortunately, this is not necessarily the case, for marginal 

linearity of the resulting 


transformations can often say little about linearity or non- 
regressions. This is one of the possible fallacies in using the tetrachoric coefficient 


(cf. Guttman, 19502). Tf m 14 do the trick, Thurstone’s 


arginal transformations cou 
problem would not have arisen, and the algebra above would be necessary. 
From another point of view, о 


ur analysis does account completely for the de- ` 
the tests on each other. 


For fixed jk, if we consider all of the pj; solu- 
) and not select merely one of them—they completely account 
al of our references have essen 


tially shown (cf. Hirshfeld, 1935; 
Maung, 1942; Williams, 1952) it 


te normal distribution, the next 


lead to a multivaria 
to obtain at least а well-defined regression system. 


pendence of 
tions to (4.1) and (4.2 
for the fy, for as Sever 
Guttman, 1941; 1950b: 


is always true that 


Pik 
fo =1+ RÎ (ab e AA, Же 77). (10.1) 
Fado р=1 E 
duce the 


ores always completely repro 
lain" bivariate dependence. To 
is way would require that the 
This last condition 


zing SC 


f regression lineari 
pletely "exp 


The complete sets 0 

observed pairwise occurrences, or com 

have multivariate dependence also accounted for th 

pairwise solutions for jk also hold for ij and th (ê 52) 0), ete. 

is of course only necessary, and not sufficient. 
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The theorem of $ 7 above states that in general no two motricizations of the 
categories of a test can maintain the same rank-order for given j. This means that 
ур are in general a polytone function of ће g(a є Aj) when р Æ р’, and conversely. 
To have but one of the p;, solutions be basic in accounting for dependence would 
require all the others to be orderly polytone functions of it. An excellent example 
of a law of formation of polytone dependence which singles out one solution as basic 
occurs in the theory of perfect scales (Guttman, 1950b, 19541). 


This paradox of polytone relations among regression linearizing solutions 
seems worth exploring in more complex situations of rank-order. It has an important 
bearing not only on the chi-square theory of statistical dependence, but also on the 
structural problems with which factor analysis is concerned, 
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POSITIVE AND NEGATIVE DEPENDENCE OF TWO 
RANDOM VARIABLES 


By Н. 8. KONIJN 
University of Sydney, Australia 


ll be called completely positively dependent if there 
m, and positively x-dependent if their joint distribution 
bution of two com ‘letely positively dependent random 
variables with the same marginals. Similar 
The paper discusses 


SUMMARY. Two random variables wi 
is an almost sure nondecreasing relation between the 
is a mixture (with mixture coefficient &) of the distri 
variables and the distribution of two independent random 
definitions can bo given for complete negative dependence and negative «-dependence. 
properties of such variates and properties of the power of various tests for independence against such types 
of dependence. 

1. INTRODUCTION 
Konijn, 1956), the author has investigated the as totic 
J р 

power of various tests for independence of two random variables against the alter- 
native that their joint distribution has been obtained by a nontrivial linear transfor- 
mation of independent random variables and has given some justification for consi- 
dering such a type of alternatives. Among the many other types one could consider, 
it seems natural to examine one under which the joint distribution belongs to a family 
of bivariate distributions, all with the same marginals, which, as an extreme, contains 
one implying a monotone relation between the variates. In fact, for any two given 
marginals, there is a unique joint distribution which implies a nondecreasing relation 
between the variates, and a unique joint distribution which implies a nonincreasing 

We shall use what is mathematically perhaps the simplest method of cons- 
such families of distributions, by linearly combining such an extreme and the 
dependence with the same marginals. 
ate distributions belonging to such families (other 
f independence) all have a singular component, 
it may well be that a number of situa- 
approximately by a distribution of this 
row band of probability mass 


In another paper 


relation. 
tructing 
distribution corresponding to in 

Tt is true that the bivari 
nding to the case о 
Nevertheless, 
represented 
d by a nar 
of the two variates. 


than those correspo 
which seems rather unusual. 
tions of practical interest can be 
family with the singular component replace 
about the corresponding curve in the plane 

2. DEFINITION AND CHIEF PROPERTIES OF F} 
are two independent random variables with distributions G 
the distribution of X, = (Yo Zo) by Fo = GH. By РР] 
owest] valued pivariate distribution function 


4 and are given by 


AND F_ 


If Y, and Zo 


and H, we shall denote 
we shall denote the uniformly highest [1 


i i } hese exis 
with marginals @ and H; that t 

Gy) i 90 < H(), 
cnl | mo if Oy) > HO) 
if Gy)+He)—1 <0 


( 0 
F_(y,2) = 1 qu) He) jf Gly)+H@)—1 > 0 
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was shown by Fréchet (1951). In fact, these expressions are easily seen to bound 
any bivariate distribution function with given marginals @ and И from above and 
below respectively, and to be distribution functions. 


Definition: Let S be the unit square bounded below and to tho left by the 
coordinate axes; then the line segment connecting (0, 0) with (1, 1) is its prineipal 
diagonal, the one connecting (0,1) with (1, 0) its secondary diagonal. Lot 


D E #< 0: 
@ = bod Octa, 
і 1 if 354. 


Let 
Rv, w) = (v)(w), 
Riv, w) = min{(v), (w)}, 
R_(v, w) = max{0, (7)2-(w) — 1. 


Let D be a class of disjoint open intervals on the abscissa of S and 


vals on the ordinate of 9. For any point / on either axis, let 1? 
of that interval (if any) 


disjoint open inter- 
equal the lower limit 


of D to which ¢ belongs, and equal to ¢ otherwise, Finally, 
let | 
RP (v, w) = (vP) (wP), 
RÈ? (v, w) = min ((vP), (wP)}, 
R? (v, w) = max {0, (v?)+-(w?) —1). 
Evidently R, is the uniform distribution over S, and R, and R_ are the uni- 
formly highest and lowest valued bivariate distribution functions over S whose margi- 
nals are uniform. We have, moreover, 


Lemma 2.1: R,[R_] is that distribution function for which all the probabi- 
lity mass is concentrated along the principal [secondary] diagonal of the unit square 8 
and is spread uniformly along that diagonal. ВЕР] is that distribution function for 
which all the probability mass is concentrated along the 


. i Е principal [secondary] diagonal 
of the unit square S and is spread uniformly along that diagonal, except that sections of 
the diagonal whose projections are contained in D hy 


ave all the mass accumulated at that 
vant interval of D. 


For Y and Z having continuous distributions 0 and H 
P(GQY)«8—- P(H(Z) « 8 = (0). 


-If G or H have discontinuities, the ranges of G or H are fiot айыы Tort erdade 
a class D of intervals and , 


end point which is projected to upper limit of the rele 


it is well known that 


PAY) < à = P(H(Z) < = (P), 
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ARIABLES 


; follows that, if X, А e distribu stri- 
It А 0 has the distribution Fy, X , the di: t t 
| о А+ ibution F. " and X the dist 


Uo = (Vo, Wo) = (G( Yo), H(Zo)), 
U, = (V4, W) E (G(Y 4), H(Z.)), 
U.. = (V, W_) = (G(Y_), H(Z_)), 


have the distributions RP, R2 and RP. In general. for X 

distribution F, we shall designate the distribution ot e d 
em. E DE of the couple U — (V, W) 

The following theorem identifies F. Р, i Hi 

tively] dependent distributions with ЖА, ү a: mares Ww t а 
results of the next section, the continuous with the ШУ ыы Sri on 
shall say that 7 constitutes a nondecreasing [nonincreasing] relation - er m : 
of two intervals if, for either interval, 7 associates with any point KR ЕА 
[еза one point of the other, none of which smaller [larger] than any oe " 
ciates with a smaller point of the former interval; r is increasing caduca Ken 


also one-to-one. 
All the mass of the F. [Е] distribution is concentrated along 


a nondecreasing [nonincreasing] (possibly discontinuous) curve—that is, there is an almost 
B B dx 8 
between Y, and 2. [noninereasing relation between Y. 


and Z ]. Conversely, if between two random variables with marginal distributions 
G and H there is an almost sure nondecreasing [nonincreasing] relation, their joint distri 
bution is FF]. If Fo is continuous, the curve is strictly! monotone and the relation 


one-to-one, and conversely. 


Theorem 2.1: 


sure nondecreasing relation 


be the union of all rectangles S’ in the (y, z)-plane such 


Proof: Let So 
c S' with y < y52« 2, 


that if (y, 2) and (у, 2') 
Gy) = Gy’) or H(z) = H(z’). 

m variables with marginals G and H assigns i 

the functions G and H are still distribution 

some inverse exists. 


$ distribution of any rando: 
ass to So, and outside of So 
ve unique inverses wherever 
U, has the distribution 22. So by Lemma 
Vas distribution is a nondecreasing (striotly 
F, is continuous) image of the principal 


Then the join 
no probability m 
functions and ha 
as the distribution F+, 


Now if X., hi 
3.1 the locus of the probability mass of the 
increasing and continuous outside S, if 
diagonal of 5, having no mass in So 


h the curve contains no probability mass may, 
> 


1Tho beginning and end points of intervals for whic! 
however, have the same yo coordinates. 
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and Z there is an almost sure monotone non 
ea set of t of zero G-measure 


Vor. 21] 
Conversely, if between Е 
-valued relation Z- 


вш = PIY < = P < 0) = ATT) 


decreas- 


ing interval (У), we have outsid 


where J(6) is the upper limit of the values of f(t). Therefore, 


H(Z) = ШК) = Y) 
almost surely, 


P(G(Y) < v, HZ) < аў = PY) 


« v, GY) < и} = min (v^) (wP) 
= P(G(Y) < v, H(Z) < wj, 


and outside of S, P(Y«y, Z&zj = min (G(y), H(z). 
By definition of So, this relation must be preserved in Sy. If fis one-to-one, we have 


outside S, 


ped? Гы 
В —1 — H 
with H and 07° free of jumps. So G and H have no jumps outside So, implying 
that F, is continuous. 


The proof for F_ and nonincreasing relations is similar. 


3. CONTINUITY PROPERTIES 


In Theorem 2.1 we obtained sharper results for the case in which Ро is conti- 
s to cons- 


nuous than for the general case. Continuity is also important if one wants 
truct nonparametric tests of independence. We state without proof : 
Theorem 3.1: If Fo is continuous, F and Е_ are also continuous 


ls G pera = : Let X = (Y, Z) have the continuous distribution Е with margi- 
nals а y hen (a) G and H are continuous, and (b) the distribution 0, 
U = (AY), H(Z)) is continuous. 


4. SOME PARAMETERS AND THEIR ESTIMATES 
For any continu ivari istributi 
H, with ш an on pompe i нў E ous ieee per 
v, = 4] Гу, 2400, 2)— Г J Fag: аР, 2)}, 
t= 12 f f (G()— f 607)49(9))09()— J (зан) ау, г), 
т, = &Ш(и, v) —G()). Н(у)). | 
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ТЕ L n t H . а ү 
the denominator is finite (it is positive for continuous P) we define p b 
. eia y 
f Ity— пав) Jed B (2 AF (у, 2)1070/— 1946(9))46(0)1 (z— J24 H (2) dB) 
p is tbe (ordinary) correlation, 7, the differ i 
РЕ : ; a PT ifference sign correlati 
correlation, тз the medial correlation. The “natural” unbiased а di = 
T, based on a sample of п independent pairs of observations are given in КО. 
(1948), the “natural” consistent estimate of т; in Blomqvist (1950). To save sp е 
1 ace 


they will not be reproduced here; we shall denote them by Т (û = 1, 2, 3), and th 
sample correlation coefficient by R,. The expectation of Spearman ê эш) үт) 


correlation coefficient 


on n+l 


will be denoted by то. 


Let us first prove the following le 
g kindly pointed out th 
d in his doctoral dissertation.) 


нае distribution function with marginal distri- 


mma, which may be of more general interest. 
(Professor Hoeffdin at for F absolutely continuous these rela- 
tions are essentially containe! : 
Let F be a biva 
d H. Then 

= fj GH dF if G and H are continuous, 


= [IFCH )dydz if the left-hand side, от fy dG 


Lemma 4.1 : 
bution functions G an 

(i) [jFdG dH 

(i) ff(y—Iyd66— 


and J2dH, exist. 
Let V and W be continuous distribution functions. Then 


fed HF 


Proof : 
п Vo) Wear, 2) = TW ond, Fr. 2) 
=W(2) | Иа, FY: 2) | 3 = vul H | V(y) d,F (0, 2) dW (г). 
Now 
| V(y)d, FY: 2) = Е, 2) | E ut we | Fo: z)dV(y) = H(2)— | Fiy, ати), 
= q Voyeur, 2 
= 19004700—1 HogaWe--J EF» дату) то). 
For V=@, И = H, this reduces to (i). To obtain (ii), let № Арона € 
let : 
| NETT UU 
Yik = y f —h<y «k 
T. ub wu 2k 
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and define z,, similarly. Let Vj, (y) = (уш 4-h)|(h4-k), War (2) = (2,4 -h)] (o 4- E). 
Let 


ов = (В | | Vary) Waled Fly, 2) 


Cu = (4-09 | | ( Vu). 


kk 
din = (a+r)? f | F(y, 2dV ly) dW (г) = { f Ply, z)dydz, 


-h =h 


бв = (h+k) | У), 


he = 0+0 | ( Vati) j^ ) aay), 


Ль = 0+0) | тан), 


fi = + | (Wae) Ê ) ame, 


; 
te = (+k) | ват) = | бу, 
-h 


" 
hin = (hk) [ Hed Wyle) = | наг. 


E 


The previously obtained result then becomes 
Wor) Жы сы ‘ ` 
бы = J^ — Ju Ты du. As ер = 2—9 and / = j— Спе fak = dy — gui Pnr 


Movesar, Сё = Ci — Crt frt 80 сы eji = Яда. The limit of the left- 
hand side as h and k — co is, by definition of the Lebesgue-Stieltjes integral 
3 


(writing j = h + k) 


J = [fyzdF(y, z)— f ydG(y) fed H(z) 


and it exists by the hypothesis of (ii). Therefore the limit of the right-hand side, 


Пу, 2)\dyde— Ody неа», 
exsits and equals J. 
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The following theorem yields classes within which the statistics Т 
(i = 0,1, 2) and В, yield consistent tests of independence : 


Theorem 4.1: Let F be continuous, 


{| dF =0or | | dF, = 0, 
F—Fo <0 РЕ < 0 


and F > Fy for some т. 
Then т; [F]>0 for i= 0,1, 2; 
and if ЈАР < оо and f JedF <o, p[F] 0. 


Proof: We first show that the two first conditions imply that F—F, > 0 
everywhere. Suppose there would exist yo, % such that (уо, 20)— Еу, 20) < 0. 
Let у’, =’ be the smallest numbers exceeding or equal to у, and 2, respectively such 
that (07, 2’) is а point of increase of F, then there would exist y", 2" exceeding y' and 
=’ respectively such that F—F, < 0 throughout y < y < y^ 20 = 2< 2", end since 
this rectangle contains а point of increase of F and F—F, is continuous, this would 
contradict the hypothesis that f Lis а Z = 0. On the other hand, let 7, 2 be the 


largest numbers not exceeding Yo, Žo respectively such that (7, 2) is a point of increase 

of Fo, then there would exist 9,2 less than 9,2 respectively such that F—F, < 0 
" 

this would contradict the hypothesis that 


throughout у <9 < 00 2 <2< а, and 
dF, = 0. 
pr, <0 
Now, 


т = 4j (FFF = ATI эз Em XI J(F—F,)dF 


> 4ff(P—Fo)dFo = $7» 


i (R(v, w)—vw)dvdw 


0 


1 
and ПР) = | 
d which is positive at at least one point and continuous, so that the 
nd w! 
Similarly 
cov (Y, B= ОЕР) 


has an integra 
integral is positive. 


is positive. 
HE PARAMETERS FOR POSITIVELY AND NEGATIVELY 
s OF T 
V-DEPENDENT VARIABLES 


t Е: = (1—юР-кЕ., F; = (1—к)Р, к. 


5. THE VALUE 


le 
РЕР For O < * < 
inition : 
Je 275 


Let K[F,] be the class of distributions o 


btainable from Ро by all such mixtures, 
In the case of one-sided alternatives we can 


consider the class obtainable by mix- 
tures of Fo with Р, or with F_; one may then, if desired, disting 


uish notationally 
K*[Fj] and K-[F)]. 
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Theorem 5.1: Under F*[P-] (assuming existence and Positivensss of second a 
moments about the means) 


9 SAY, 2) < k[—k < gF, Z) < 0]; | 


the equality is only reached in case of a linear relation, 


Mx, 


Proof: Tf, for example, X has the distribution Fi, | 


cov(Y, Z} = EYZ—EYEZ — (1—к)/ угар, 


аР, [удан 
«(SJyedF—JSyzdP) = к covt 


Yo 2.} > 0. 


те 


The last conclusion of the theorem is well-known, (see Cramér (1946) p. 265). 
We now proceed to compute the values of 


the 7; under РЁ} 
continuous. Using Lemma 2.1 we got 


and P, assumed 


i 4 
ЛЕ. аЕ, ы | J FR, = [T + 
0 


since В. (и, w) = v along the diagonal v — w; 


id 11 То 
ГЕ. ав, = J J 8ہ‎ = ГЈ sca i J dude = 1, 


1 


1 
ЛЕ. = TIRAR = йр 
оо 0 85 


since (v, w) = v? along the diagonal v — w; 


фа 1 
IJF-aP. = | J #-4В_ = Jowo, 
(where w = 1—w) since R_(v, w) = 0 along the diagona] «ee 


X 1 
HF-AF, = Í [RR = 


* £ 1 
оь J J Вий. = d (1a yay’ =4, 


(whore w= 1w) since Jesu) = ow) equals. „л 
м. 


l—w) along the diagonal 
gw. 
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Consequently, 
[Так = (к24-2к--3)/12, fJF;dF,; = (—к2—2к--3)/12, 
ПЕР = ИРДЕ = (K+3)/12, [[F;dFy = SJF AF; = (—к--3)/12, 
so, in obvious notation, | 
тї(к) = (к®-4-2к)/3,  TI(K) = —(к®-Е2к)/3, 
T3(K) = к, 73(K) = —К. 


Let ш ‘be a median of G, then, since С is continuous, б(д) = 1/2, and, as 
P(G(Y,) < 19 =1 [2, 1/2 is the median of the distribution of G(Y,). Similarly, if v is 
a median of H, H(v) = 1/2, and 1/2 is the median of the distribution of H(Z,). 


Therefore, 
Film э) = В. 3) = (1—к)+к. $, 
Ри = REQ, D-ü-O0rte 0 
so T3(K) = к, т5(к) = —K- 


Under F*[F-], assumed continuous, т; [—7;] equals (к24-2к)/3 


Theorem 5.2: 
Consequently, for F continuous and i = 0, 1, 2, 3, 


for i = 1, and к for i = 2 or З. 
T;— 1 0?" Т; = —1 


if the variates have am almost sure increasing or decreasing relation. The converse 
holds for à — 0, 1 and 2, but not for i = 3. 
the converse for i= 1, merely observe! that for (Y,, Ү,) and 


To show 
ted with common continuous distribution F 


(Zi Zə) independently distribu 
n LF] = 2Pr. (Y1 YZ) > 0—1. 


To show the converse for i = 2 note that 


„Ру? Р-Р» = ij ja, w)—vw}dedw 


mized for R= R,, that is, for F = F, by the definition of F,; and if № 


is maxi 
1*1 

ximizes | [085 w)—vw)dvdw, 
0 


‚ 
R = R, almost everywhere, so everywhere 


also ma 
(since continuous). The proof for тЇ] ——1 is similar. One sees easily that 
4-1 (and also т] = + 1 if n> 2). 


TF] = +1 implies ni] = 


on to Professor W. Hoeffding. 
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TF] = +1 does not imply an almost sure monotone relation between Y and 
Z. For it is easy to visualize an almost sure non-monotone relation between Y and 


Z, the graph of which lies entirely in the two positive quadrants defined by the medians 
of Y and Z so that 7,[F] = 1. 


As a contrast to these results РЕ] = +1 if and only if all the mass of the 
F distribution lies along a straight line (assuming existence of second moments), 


To the above corollary corresponds the following obvious property of the 
statistics Т: 


Theorem 5.3: Let T^, be continuous. For i = 0, 1, 2, 3, T, equals 1 under 
F, and —1 under F_ almost Surely. 


6. UNBIASEDNESS, BOUNDS ON THE POWER AND ASYMPTOTIC POWER 
AGAINST ALTERNATIVES IN K 


Theorem 6.1: Let Fo be continuous, i = 0, 1, 2, 3, а = (a', a^), 


EC = (6, же a) : (2, а Lp) < а! or > а}, 


k n 
PRS} = ГГ I аР даз) | | dP (x), 
S8in(@) j= jekii 


and let tj, be the largest and tin the smallest number for which 0 < „== PUT. = 
tin |Fo} and 0 < оу, = РТ, > tin | Fo} do not exceed a’ > 0 and a” > 0 respectively, 
but converge to these numbers. Here Gin tei, = Qin and a +a" = a < 1/2. Denote 
the power of the (a', a")-level independence test JZ; (based on Т. 


in) against Р; by f, 
ЖҮ), къ 0. Define similarly the test against F} and its power. Then 


6) AIF) = Q—9* aay bey" (ру кб кз pp {Six бый 


(with the same holding for superscripts —); 


(b) 1—(1—k)"(1—a,,) 2 PF P) in K'(1—9.) (with Pa = Рі or Рх) 


which exceeds o, for all continuous F, so that the `2, 


are unbiased against alternatives 
т K[GH] with G,H continuous; : 


(c) the first inequality above is sharp for all but the lowest n: 


(d) the danois of f, CZ; | F,) with respect tok айк — 0 equals ne;, where 
€; > 0 and F, =F} or F}; 
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(e) the same results hold for the correlation test if and only if there exist numbers 
а > 0 and b such that G(y) = H(ay--b) [under Fz: 1—G(y)— H(—ay-+-b)] for almost 
ally. (In this case the critical values tin ро Us po, depend on the particular Fo); 


(f) the Ti, are unbiased estimates of the ты for i= 0, 1, 2; and, under Fo; 


ЕТ, = 0 for i = 0, 1, 2, and 3. 


in 
Proof: (a) The proof follows by expansion of 


n 


pef Пат) 


Sin(tin) =1 


noting that, by Theorem 5.3, Рі, {Silt = 1. 


(b) and (c) follow from the fact that the Pj, {Sin (tin)} are nondecreasing func- 
tions of k, and strictly increasing ones, 


(b) for small А (including at least k = 1), and 
bs (for all but the lowest n): for k near m (including at least k = n—1). 


(d) follows from the fact cited in the proof of (b), 
(е) and (f) are immediate. 


We now examine the asymptotic power against positive к’ near 0. Let d(«) 


= (2) т exp(—ť?/2)dt, Ф(б') = о, 1—Ф(8”) = а". From the results of Hoeffding 
dos | 
(1948) and Blomqvist (1950) and using the linearity in к of F} and F}, it follows 
easily in this case that Tin iS asymptotically normal (nonsingular except at к = 1). 
For R, when the second moments are finite, we note that ER, = p+0(1/n), 
"n 
ER, —p = = --0(n3/2) and the asymptotic distribution of (R,—p) (k[n) 1? is nor- 
n n 
d unit variance if the fourth moments are finite [Cramér, 1946 
359 and p. 366], where Ё is a certain expression approximately equal to 1 when « 
P. Moreover, the moments are polynomials in k, and by Theorem 4.1 p is 
der P; and negative under F;. This gives the following result, in which 


should be noted. 


mal with zero mean an 


is small. 


positive un 
independence from Ро 


Theorem 6.2 : Consider (a, a”)-level tests for independence based om Tip 
(i = 0,1,2 3), and (if G and H have finite fourth moments) on R,; and let Fo be conti- 
nin For alternatives in K*[F,] and КР] for which x’ is near 0, their asymptotic 


power 18 


1—Ф(4"—/пк')+Ф(@'— Vink’), 


whatever be Fo: 
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DEFINITION AND USE OF GENERALIZED 
PERCENTAGE POINTS 


By JOHN E. WALSH 
Lockheed Aircraft Corporation, California* 


SUMMARY. Ofton what is assumed to bo a random sample is in reality a set of independent 
obsorvations oach of which is from a soparato statistical population, whore somo of these populations are 
noticoably difforont. from tho othors. Also, situations whore tho observations aro known to bo from not ico- 
ably difforont populations froquently ariso. Then “population percentage point” represents an undefined 
concopt. This introducos the problom of goneralizing the population porcentage point concopt to tho situa- 
tion of a sot of indopondent obsorvations from possibly difforont populations. Tho resulting generalized 
porcontago point paramotor should reprosont an intuitivoly understandable “average” of the percentage 


point proportios of tho populations involved. Also reasonably accurate point estimation, confidence 


intorvals and significanco tosts should bo available for this generalized porcentage point parameter on 
tho basis of tho obsorvations. This paper presents a gonoralizod porcentago point concopt which satisfies 
oments for situations whero tho populations aro continuous and do not differ groatly. Approxi- 


thoso roquir 
ng modian estimatos and confidence intervals for the values of 


mato mothods are outlined for doterminir 


gonoralizod poreontago points of tho typo prosentod. 


1. INTRODUCTION 


Statistical analysis based on à sample usually is much less difficult than a 
ased on independent observations from possibly different statistical 


similar analysis b 
a tendency to assume that a set of independent 


populations. Consequently. there is 
observations represents à sample, even when rough approximation to this situation is 


doubtful. Then à population percentage point investigation may not be meaningful 
since “population percentage point" is undefined if the observations are not from the 
same population. The purpose of this paper 1s to introduce a generalized percentage 
point concept which is applicable when the observations are from noticeably dif- 
ferent populations. These generalized percentage points appear to be useful when 
the populations are different and reduce to the corresponding population percentage 
points when a single population is involved. Use of this ape percentage point 
concept, combined with the application procedures developed in this paper, protects 
rroneous assumption of a random sample but incurs little penalty, if a 


against the e 
exists. 


random sample actually 
ons which a generalized percentage point might 
onsist of a set of n independent observations 
These conditions are: 


Let us consider some conditi 
be expected to satisfy when the data c 
from possibly different populations. 

(1) А generalized percentage point 0, is completely identified by its defini- 


tion and the value of p. 


now withthe System Development Croporation, Santa Monica, California, U.S.A. 
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(2) The value of 0, depends on all of the n populations from which obser- 
vations were drawn and represents some sort of continuous average of 
percentage point properties of these populations. This “average” should 
be directly related to the value of p and capable of intuitive interpre- 
tation. 


(3) If all the observations are from the same population, 0„ equals the 100p 
percent point for that population. 


(4) Reasonably accurate point estimation of 0, should be possible on the 
basis of the n observations. 


(5) Reasonably accurate confidence intervals and significance tests for a, 
should be available on the basis of the x observations. 


It is hardly to be expected that any formulated definition of 0, has the property 
that (1)-(5) are satisfied for all possible sets of n statistical populations. Conditions 
(4) and (5) limit the allowable situations. However, a 0, definition can be obtained 


with the property that conditions (1)-(5) hold for a rather general class of sets of 
n statistical populations. 


Let us consider the restrictions on sets of n populations which are adopted in 
this paper. Stated in a qualitative manner, these restrictions are: 


(a) All the populations are continuous. 

(b) Let ж; be the observation from the i-th population, while 
р = Pr (ж < 0„), (6 = 1, ..., т). The variation among the р; is required 
not to be too large. 


Restriction (a) is not completely necessary. That is, the 0, concept presented 
can satisfy all of (1)-(5) for situations, where one or more of the populations are not 
continuous. However, the analysis is greatly simplified if this restriction is adopted; 
moreover, many applied situations are such that (a) is acceptable. Condition (b) 
is stated in rather general terms. A technical specification of what is meant by the 


requirement that the variation among the p; is not too large is given in the Definitions 
and Results section of this paper. 


The quantity 0, presented in this paper is not necessarily unique. That 
is, there could be a range of values, all of which satisfy the Specified requirements for 
0». This could happen, for example, if all the populations have probability density 
functions with ranges of zero values which are both preceded and followed by ranges 
of non-zero values. This lack of uniqueness property of 9, causes no difficulties in the 
analysis; moreover, lack of uniqueness can occur even if all the populations are the 
same. 


The next section is titled Definitions and Results. This section contains the 


definition for 0, and a technical statement of restriction (b). A method for approxi- ' 


mate median estimation of 0, and a procedure for obtaining approximate confidence 
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intervals for 0, are also given. Significance tests can be obtained from these confi- 
dence intervals in the usual manner and are not considered explicitly. The final 
section is titled Verification and contains an outline of the basis for the results 


presented. 


2. DEFINITIONS AND RESULTS 


Let us consider a set of n independent observations whose values are denoted 
Ъу 21, s&p Each observed value can occur from a possibly different statistical 
population, where these populations are unknown but fixed (i.e., not a sample from a 
universe of populations). The parameter 0, is defined as follows. 
Definition of 6,: The quantity 0, is the value, or set of values, which satisfies 
the relation 


n n 
1$ no, D Pre < б) = 
ici i=l 

The quantity 0, has an uncomplicated intuitive interpretation. Namely, 0, is the 
100p% point of the statistical population whose probability distribution is the 
arithmetic average of the distributions for the n ehssisations considered. It is easily 
verified that the 0, concept defined here satisfies conditions (1)-(3). 

The point estimate and confidence interval results presented in this paper 
are based on the assumption that each a; is from a continuous population and that the 
variation among the values of the p, is not too large. The allowable variation res- 
triction is used in deriving confidence intervals for 0, and depends on the particular 
confidence interval considered. Let 4(1), yo) be the values of the 2, arranged 
according to increasing algebraic value. Thus y(1) is the RAE of ше а, while y(n) 
is the largest. Then a confidence interval of the type considered is based on an 


arbitrary pair of the quantities (0), y(1). --- y(n), y(n--1); here y(0) = — оо and 


g(n4-1) = o. 
Suppose that ym) and у(ль-Е1) are the two quantities used, where 
0 < n, «mn «n and that the combination y(0), y(n+1) is excluded. Using g=1—p, 
> Su № = 


let C,(m1, no) equal 


in (v, nj--j —1) n—v 
5 min (v, +7 ار‎ ( ne qn—v—nj—j--k4-1 
m,4-j—k—1 
Gal в max (l, ов LM 
1 n 1 n 1 Y ( y 
= „| =— ФР» 
and = »(^ D i 
11 1 e 


ound for the value of o. Then restriction (b), the require- 


" А s п upper b 
реви кө" a g the p; is not too large, can be expressed as follows, 


ment that the variation amon 
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Technical Statement of (b): 


по n na n 
noy | Ст, т) | < min 5 ( )r qt, 1— > ( )r = 
r 5 


т =m т=тү 


The value of oy is such that 


and either or both of 


iat 1 1 
8700 | С, na) | + ES Ty | Сз(па, na) | < 30 | Ст, ть) |, 


1 1 2 
8 way | Ош, na) | + 3 тоў | С. (па, nj) | то |Cs(n,; ть) | 


1 na n na n 
< y min №. ( jer 1— > C ) pq 


r= Ny 


are satisfied. 

Next let us consider the statement and properties of the approximate con- 
fidence intervals for 0,. If the populations are all continuous and technical restriction 
(b) is satisfied, then for a very general class of situations 


y(n) < 0, < y(M2+1) 


is a confidence interval for 0, with a confidence coefficient approximately equal to 


na n 
> ( )r TUE (024-02)0«(,; ть) 


ran YF 


and a maximum confidence coefficient error of about 
Thy in 2 
4 (00—01) |Cs(4; m3) |. 
Here су is a known lower bound value for o and is taken as zero if no better value 
a 


is available. 
Confidence Interval Example. As an illustration of the method, let n — 14 
9 = 2 


р = .3, ср = .05, ср = 0, т = 2, My = 1. Then 


т 014 
T ( ) (.BY (т? = .9210, | C4(2, т) | = .1076, 


r 


| G2, 7) | = 1662, | 042,7) | = .2179. 
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Then technical restriction (b) is satisfied since 


А 2 zx m па n 
.0038 == по? | Ст, 22) | < -0790 = min Ў; ( )r q^, -5 ( ) pg 
я 


n, nı 
р. 2 1 РРС! 
.00372 = A пос? | Can. n3) | + 5 тр |Cs(n,, ь)| < -00538 = 5p Ce ть) |. 


Thus, if all the populations are continuous, 
y) < 03 < (8) 


is a confidence interval for 0з with a confidence coefficient of approximately .922 
and a maximum confidence coefficient error of about .001. 


Finally let us consider the procedure for obtaining a median estimate of 0, 
This is based on the confidence interval results already presented. Thus it is required 
s are continuous and that technical restriction (b) is satisfied. 


that all the population 
alue of n, which satisfies both 


Let n, = 0 and determine the у 


< "5 n, ә 2 
> ( yj. (oy + o7)C.(0; No) = 1—6, 


net] n n 
( ) p йа (сї -„т1)О.(0, ту4-1) = 2+6, 
" 


where є, € 2 0. 


E КЕЕ 2 
ЕТЕ (nat МЕ Tz, (23-2) 


ate median estimate for 0}. Here technical restriction (b) must be 
satisfied for both na and 4-1. The value for n, will often be in the vicinity of (n —1)p 
and this represents first approximation to the value of ng. The magnitudes of 
C4(0, тз), C(0,%2)> and С,(0, na) are all small for values of n, near (n—1)p. In fact, the 


vius of Со, m) is zero if the ideger m, 28 equal to (%—1)р. Consequently, 
a 2? 


technical restriction (b) is nearly always satisfied in the case of median estimation 
of 0,. 


is an approxim 
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Median Estimate Example. To illustrate this estimation method, let n = 20, 
р = 2, ср= 1, ср = 0, and m, = 0. Then it is found that 


3 ,90 
` ( ) (.8)'(.2)20—7-4-.050,(0, 3) = .4086, 


T 


4 / 20 
> ( ) (.8)' (.2)2°-7-+..050,(0, 4) = .6306. 


т=0 \ T 


Thus the value determined for m, is 3 while в, = .0914 and e, = .1306. Technical 
restriction (b) is easily shown to be satisfied for both n, and n,+-1. Hence 


.59y(4)+.41y(5) 
is an approximate median estimate for 0. 


This median method of estimation combined with the confidence interval 
results shows that the generalized percentage point concept presented here satisfies 
all the conditions (1)-(5) when restrictions (a) and (b) hold. Significance tests can 
easily be obtained on the basis of the confidence interval results. "That is, let 


(п) < 0, < у(п--1) 


be a confidence interval for 0, with confidence coefficient œ. Then the following 


rule is a significance test for comparing the specified value 0, with 0, and the 
significance level of this test is 1—0. ? 


Rule: Reject that 0,9 which is equal to (or contained in) Op, if either 0, y(n) 
or 6,9 > y(ng+1)- | 


3. VERIFICATION 


First let us consider the background for technical restriction (b). The 
combination of conditions A 


na n na 
по |04, т) < min | У; ( ) ve у ( d pe- 
T ? 
ny 


Ta 


1 2 1 1 
E по | C41 na) | A 3 e v|Cs(ni; ть) | < 50 |Om, na) | 


was taken directly from reference [ 1]. Consider n independent binomial observations 
with “success” probabilities of p,,.... p, and let X be the observed number of 
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"successes." Then, on the basis of reference [ 1 ], if this combination of conditions 
holds, 


5» n 
Pr (ny S X & nj) = >, ( ) pq" 24 diro). ma) 


with a maximum error of about 


4 (obo) | С.т, па) |. 


Examination of the general expansion for Pr(v, < X < x) presented in Walsh (1955) 
strongly suggests that this approximate probability relation also holds if these condi- 
tions are replaced by the requirement 


2 1 2 
E nog | C4: Mg) | + 3 nat | C1» n3) | +h not | C(1: п») | 


1 I na n na n 
< „тө >, ( re^ 93 Ө put 


n, T 


This condition asserts that the sum of the second, third and fourth terms of the 
Prin € X <») expansion amounts to a& most 5 percent of both the firs& term and 
the probability complement of the first term. On the basis of the considerations of 

ly indicates that the third and higher order terms of the ex- 


reference [ 1 ], this strong 

pansion can be neglected. To obtain mathematical rigor, the allowable populations 
are limited to those for which the approximate probability properties stated for 
Pr(n, < Х < Mg) are satisfied. However, the important practical consideration is 


that these properties appear to be satisfied for virtually all situations of interest. 


Next let us consider the confidence interval implications of technical restric- 
tion (b). Since the statistical populations are continuous and the observations are 


independent, 


Pr (у) < O, < y] = Pr(n, < X < т) 


and me stated confidence interval properties are verified. 
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Finally let us consider proof that C,(0, na) = 0 when n, = (n—1)p. By 


definition, 
n—2 n—2 
040, na) = ( ) pu go TA ( Je” j am 
na—1 na 


(n—2)! pr:—1 n-—mns—1l 
s Ny—(n op | > 


Ng! (n—na— 1)! 


justifying the stated relation. Similar analysis shows that C,(0, n) and C4(0, ть) tend 
to have small magnitudes when n, = (n—1)p. m еа 
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JOINT ASYMPTOTIC DISTRIBUTION OF U-STATISTICS 
AND ORDER STATISTICS 


J. SETHURAMAN 
Indian Statistical Institute, Calcutta 
and 
B. V. SUKHATME 
Indian Council of Agricultural Research 


SUMMARY. It is shown under some mild restrictions that the joint distribution of a U-statistic 

(Hoeffding) and the ay-th order statistic tends to (i) the bivariate normal distribution if e y р.0<р<1, 
т 

(ii) the joint distribution of two ‘independent variables, one of which is gamma and the other normal, in 


case ап — constant or п ар constant, (iii) the joint distribution of two independent normal variables 


nan 


if an — 2 such that =" „> 0 or "4" 4 0. The above results are generalised to the case of several order 
a 


statistics and several U-statisties. The generalisation to the case of several populations and generalised 


(Lehmann) U-statisties is also pointed out. 


1. INTRODUCTION 


Let 24,... Xp be n independent observations on a random variable X. 


Writing down the observations in increasing order we get x4, S t <... < tiny 
If (a,) is a sequence of integers satisfying 1 < а, < n, then by the a,-th order statistic 


У 82 у re interest 
we mean Yan): The random variable corresponding to it is X(,). We are interested 


in the joint asymptotic distribution of XQ) and any U-statistic. Sukhatme 


5 shown that the joint asymptotie distribution of Xr, and an 
(1957) has showr J (m) y 


у у a distribution whose density i is 
U-statistie with a bounded kernel, from : y function is 


continuous at the median, is bivariate normal. We now proceed to prove the 


results stated in the summary. 


2. THE CASE a, оо; 5" р, 0 <р <1 


Я a n independent observations on . Р 
Theorem 1 : Let y .. Yn be 1 a random 


vith distribution function F(x) and density function f(x). Let f(a) be continuous 


variable X v s 
at 0, the ep quantile of the population. Let y be the a,-th order statistic from the sample 


} >p, 0<2 <1. Let Y be the — variable corresponding to у. Let 

where — 

U, be el as a U-statistic from the bounded kernel 
n 


(wy, «+ Wi): ws D 
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Let E(y(X,, ... Х,)) = m. 
Then the joint distribution of 
& = Vil¥—0), т = Vn(U,—m)} 


tends to the bivariate normal. 


Proof: Let plz) = Е, Xs, ..., Х,)) 


(2.2) 
А Е+(Х)—т)? = о? (2.3) 
Í (т) f (de = т 
I (e(w) — m) f(w)dw = m” 
0 
Г ()—m*f(wdw—o? f ~ Bs 
j (ew) 7m» f (dw = оў 
ge dn 
Then we have | m'+m" = 0, 014-05 = g?. (2.5) 
The characteristic function ¢,(t,, tg) of [& 7 = £y (¢(X,)—m) \ 
n * 
is given by eh, №) = E(exp (it, 5 + ity) = E, [E(exp (it, E itg) 1E}. ... (2 6) 
Then proceeding as in Sukhatme (1957) it turns out that | 
= т! e е, . — 
Pnlty te) = Мал n exp (it E+ it, E ) fü)dyx © 
у (w)— 1 © 
е Ф = 
р "n (s rm ie] | | exp ( it, PO) шуды]. 
(2.7) 
Putting w = 0 + Tz we find as in Sukhatme (1957) that 
Га it. > ) f(w)dw = p+ it LES o? A i 
je (is Л “Уй 52 dE cR ELS 
. (2.8) 
| exp (i. MI) шй» = qit E og AÀ ait 


р ——— i8. „ү 1 
; т n 2n Vn я o(—) 
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where À 


| 


Г (0 He у) 


1 


д 


| 
£ p Я [ ... (2.9) 
кин | 


We also note that for a fixed £, ЛЕ f(0) as n-00 since f(w) is continuous at 0. 
Using (2.8) and simplifying (2.7) after expansion 


vel HE (i HK") шге | x B exp (ir =") rege ] 4 

à эй, pm" mh (m m” 
= const — $ o J+ Seah —=)+(®—+"—)+оп) (в) 

d 
© elo ==.) —т М 
Thus — e,(f, 15) = const X | ехр [i E+ it, y — са + 

$ "о it " u A E 

+8 (+ m) Pile (MT) — „+ о@]/(#+)& e enm 


Now letting n—0oo, and taking the lim sign inside the integral, which is valid in virtue 
of the bounded convergence theorem, we find that the right hand side of (2.11) without 


the constant tends to 


е 2(0) , 2% m' m _ 6 m mi 
ro [ exp [щ&— OF + 79 8 (5 mH ott ("= + mh Nae 


t 2 tito = ET 2а) pq f Pq ] 
2 


5 ' 
= ЛӨ 9XP Ба z p) FO 20) 


and the constant оп the right hand side of (2.11), namely 


1 Eu 
V ?mpq ` 
7 А Е PY 
[ ехр (= Бат. o(1)) me v) 
ü ont, m_m) ра ü pq. 
Thus plti (5)? #0 ta) = exp | - ao d | 1 A 10) 3 70) | (2.12) 
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Thus, we see that the asymptotic distribution of (Е, 7’) is bivariate normal. From 
Hoeffding (1948) we see that E(y—tj > 0 as n— co, 80 that 7 and tj are asymptoti- 
cally equivalent. Thus the asymptotic distribution of (2, у) is bivariate normal 


with zero means and asymptotic variances 


S 5 and Ро? the correlation coefficient 


P 


being (= = = у (2.13) 
q pl c 

The above theorem was proved under the condition that (ит, ... 9) is 

bounded. It is easy to show that the theorem holds good provided y(wy, ... w) is 


bounded on any bounded interval of (wy, ... w,) and 


Е (J(X,, ... Х)) < o. s. (2.14) j 
This condition is sometimes more useful in practice. 


The above theorems can be easily extended to the case of several U-statis- 
tics each of which satisfies one of the conditions (2.1) or (2.14). Again, the result 
can also be extended to the case of several order statistics. The last extension is 
quite straightforward, but the proof involves heavy algebra and hence will not be 


given here. 
Another kind of extension of the above results is as follows : 


Let (21, ..., 2) апа (yy; -++ Yna) be independent observations on two indepen- 
dent random variables X and Y respectively. Let {a,}, {bn} be two sequences of integers 


and let Z}, 2 be the random variables corresponding to tia, and ШУ Let = ce p 
1 по)" n р 


b 
Pe 0 <р, Pa < 1. Let 0, and 0, be the p,-th, po-th, quantiles 'of X and Y res- 
pectively. 

Let Unm be a generalised (Lehmann) U-statistic with kernel (wv, ... ж; 
yı, ... Yta) Which is either bounded or bounded in any bounded interval of its 


arguments and possesses a third moment. If E(y) = m then the joint distribution of 


{Vn (21—01) I mA — 05), VT, т) 
tends to the trivariate normal distribution as n4, тоо such that "1— c, 0 <c < oo 
% at — 5 с : 


The proof depends on the fact that 2 


- 1 = == Ny 
Un, та d Xi V td 
iie ed چ‎ UD MR Ea 


are asymptotically equivalent (see Fraser, 1957) where y 
ylz) = Е((жу, Xo. Хы; Ү,,..., У,)—т 
and Wl) = EX, =.. Ха; Yi Wes wey, Ув) =. 
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3. THE CASE а, э s or n—a,—(r—1) WHERE $ AND Г ARE CONSTANTS 

Since {a,} is a sequence of integers, it is obvious that and a,—s and 
n—a,—(r—1) implies that after a certain stage, and a, =s and n—a, = (7—1) 
respectively, so that we may take them to be s and (r— 1) respectively for all n. 
Thus we wil have to find the joint asymptotic distribution of the s-th and 
(n—r--1)-th order statistics and U-statistic. 

Theorem 2 : Let п... t, be n independent observations made on a random 
variable X with a distribution function F(x) which is continuous. Let y and z be the s-th 
and (n—r-+1)-th order statistics of the sample. Let U,, be a U-statistic generated out of 
a bounded symmetric kernel Vw, «<. б). 

Let BWWAX,, XD) =m 


=з F(Y) 
7 = n(1—F(Z)) 7 sx (ЗВ) 
$ = Vn(U,—m) 


stribution of (Е, 1, ©) the variables ате independent and the 


Then, in the asymptotic di 
d normal, respectively. 


marginal distributions are gamma, gamma an 
Tt is obvious that we need prove the theorem for the case of the 


Proof : 
rectangular distribution only. 
Let 
g(x) = Е(ү/ (жу, Xs s Xj) E(e(X)—m) = о? we 9:2) 
g = 1_3 (o(X)—m)- 
Vn i-l Ы 


acteristic function of (E, 9, €) is given by 


Erit its = 2 [E( exp (ihét itant itak" )1E, vy). ... (3.3) 
(im 


Then the char 


ФВ, ty, ty) = Е exp (it 


Arguing as before 


Palto to ts) 


ы рр ер [аенын (т rc] 


И ШЕ. 
(8) —r—3)e UTE 


si В — 
x [e (aem ) dw ] x [ I exp (i, ) dw | x 
0 / 
x| [ exp (it хуа ) dw Em. ... (3.4) 
1—7/n 
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It is easily seen that 


Eln 


| ехр( ails Cn ) dw =% +o (—) 
1—"7[n 2 
(a (zg (fee (1) poem 


so that 


d 
n! . . | e|2]—m 
фФи(, tests) = уа вур) т f | exp | esee ( G de 


O<E+I< 


« =! rs 8-1 в ERU ҮТ 
| x[E+o(1)? x [1—55 ва—5—7+-о(}) | х[--о(1)]'-14®@. 


(3.6) 
Now letting т-эсо, and taking the lim sign inside the integral which is valid in virtue 
of the bounded convergence theorem, we have 


^ s 
t 


Jim. e, (i, fy ta) = const X | exp | in itj 3 0-Е | yf Addy 


=—8 


T 1 3 ga 


= const X eX 2 
aca tage „= BT) 


where again, we can easily see that the constant is unity and hence the theorem i 
proved. Extensions to the case of several U-statistics and generalised Манеа 
ате obvious. Incidentally we note that the s-th order statistic and the (n—r-+1)-th 
order statistic are independent in the limit, if ғ and s are constants 


4. THE CASE 4,00; ба 40 op n 
^ 5 >0 


Theorem 3: Let 2%,...,%, be m independent А 
Г : observat 
variable X with distribution function F(x) which is continuous p on = pe 
sequences of integers such that {а„}, {bn} be two 
b 
5 € n and a,—«0, رکا‎ Ay 0, 2^ с 

1< а, < % < n n n "b a, < Ка constant, ... (41) 

where €, = n— by 
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Let U, be any U-statistic generated by a bounded kernel (му, ... wj) and let 
ot) = BW ey Х.Х) \ 

ВОХ) = m I 


E(e(X)—m)? = c? 


Then the asymptotic distribution of (5, 1, 6) where 


ge = (F(x 
б = A/n(U,—m) 


Fe a : . —#(: à 
is given by the density function, constant Xe ( TS ud 


Proof: 1% is obvious that we need prove the theorem for the rectangular 
n кы 


" 1 
case only. If С = 75 2, 09-m then e,(A, ts, &) the characteristic function 
i 


f (E, 3, €) is given by 
ef ta, ty) = EC exp (Filan its) 
= Be, „(Е exp (HE +a Fits), 1]. e (4.4) 


Proceeding on the same lines as in the previous cases 


(hs tay ty) n! D Mas, 
Pallas los 13) = Ta, — p änn t l)en)! n? 


(жуе 3 Е) т o(1—& — V a) 
ВЕНЫ | - | г. 
TIEI it, EF Чы] | t | = x 
0< E 
—1 
- э 
4 ọ(w)—m 
exp ( [A a) dw x 
е vn n—An—Ca +1 
Cn У ту 
4 9(w)— Ji dib x 
exp (its As 
m = 
Cn 
: w)—m Л 
ехр ( tts oe ) dw dédy а (4.5) 
р га Sha 
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which on simplifying, as done in previous paras, reduces to 


A - 
e dn V n Е )—m 
n n 


Palli» ta, в) = const X | | exp 309 Hita 


Vn 
Nant + Уст <N—An—Cn 
С, мс 
1—8 n VES 
E ms i) ве т 
| на 805—5 нош | йй. ... (4.6) 


Letting тоо, (the limits for Е, 1 become (0, oo), (0, со) because а < К), the above 


integral without the constant tends to 
oo wo n р 2 
_ В „ү . ER S 0 Y {2 i i 
| | ехр | aad iE ilt] — 5 7] d£dy— 2n exp| -52 80 a (409) 


and the constant which is 


1 1 
P exp(—2)2—0]2-Fo(l)ydédy 2m 


Jang М бт тап 


Thus [UM la; جوا‎ exp [ d — f 2 
9 $9 v^ | (4.8) 


and hence the theorem is proved. 


The extensions to several U-statistics and ; 

a generalised U-statisti А 
diate. We note that the a,-th and 5,-th order statistics when e ponent 
are asymptotically normally and independently distributed if sealer tha) da id 

: olds. 


tion 4 and section 5 we h 
In secti ehe d ave апе that U, апа F(Y), where Y is the 
ath order statistic, have a certain asymptotic distribution. To make a similar 


statement about U, and Y we need the following, (in case 50). 


F(a) has a density function f(x) and lim ; 
Ew exists and is not equal to 


gero. For an example where this occurs, we may cite the ex i 
ponential distribution 
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5. REMARKS! 

In the above proofs the use of characteristic functions has partly 
blurred out the picture of the exact process by which the limit distributions were 
attained. To gain some insight into this aspect we reason as follows: Suppose 
we fix the a,-th order statistic, i.e., the normalised variable corresponding to it. 


n 
If 7’ = A M (+(Х)—т) we sec that it splits up into 3 parts. One part being the 
vn 
isl 
sum of (а„—1) independent random variables where x ranges on (—90, aq,)), another 


part being the sum of (п —а,) independent random variable where x ranges on (24,),00) 


Tia y)—m "- E 

and a third part fixed at 8" (aw) —. From an application of the Central Limit 
Vn 

Theorem we see that the limit of the conditional distribution of у’ for a fixed E is 

normal. It ean also be shown, after some algebra, that the mean and variance of 


this distribution are 


(—f(0) č ( m" m ) ,0)— ( im + A )) in the case 2 
| $ 


q DP я 4 
(0, 0?) in the case 3 
(0, с?) in the case 4 


and in each of these cases we know the limiting marginal distribution of Ё so that 

and in eac "s i S | sigle 

t] t of the limiting joint distribution can be concluded to be bivariate normal 
ne nature пе o Cane: Е 

: listribution of independent normal and gamma variables in case 3, 

distribut 


in case 2, the й і i 
in case 2, t al variables in case 4. The conclusion 


and the distribution of two independent norm 


can be justified using the following 


Lemma: Let (Xp: У») be a sequence of random variables. ^ Р,(у/а), the 
conditi | distribution of Yn given X, = 9 tend weakly do u ee, Flo). 
onditional dis „oinal distribution of X,, tend weakly to a distribution function 
ate) g^ a e Mens conditions F,(v, y) the joint distribution of (Xp, Y,) tends 
ay. hen и 


to the distribution Г Fase. 


fficient conditions are (1) F(y/x) is continuous in y for each x, 
ifficient 3 


densities g,(x), g(x) respectively, and g,(x) — g(x) 
Both, that these conditions are sufficient 


A set of st n 
(2) G(x), G(x) admit of probability = 
uniformly in any bounded interval of 2. 


and tl these € udit 8 а5 an be easily verified 
hese ёо ditions are satisfied in our case, ca y 
hat these € fi 


— وف‎ f the Indian Statistical Institute for these remarks. 
а о: 


. S. К. Mitr 
1'We are grateful to pr. S. К. 


297 


12 


Vor. 21] SANKHYA : THE INDIAN JOURNAL OF STATISTICS  [Panrs3 & 4. 


6. ACKNOWLEDGEMENTS 


The authors wish to thank Dr. D. Basu of the Indian Statistical Institute 
for the several suggestions and helpful discussions during the course of the work 
presented here. 


REFERENCES 


Fraser, D. А. S. (1957) : Non-parametric Methods in Statistics, 257. John Wiley & Sons, New York 
a. b 


Horrrpine, W. (1948): A class of statistics with asymptotically normal distribution. Ann. Math. 


Stat., Series B, 19, 293-325. 


Suxuarme, B. V. (1957): Joint asymptotic distribution of the median and a U-statistic. J. Roy. Stat. 
Soc., Series B, 19, 144—148. 


Paper received : June, 1958. 
Revised : December, 1958. 


298° 


SOME SAMPLING SYSTEMS PROVIDING UNBIASED 
RATIO ESTIMATORS 


Ву №. S. NANJAMMA, М. N. MURTHY, and V. K. SETHI 
Indian Statistical Institute, Calcutta 
| SUMMARY In this paper, хараа пз of many of the selection procedures commonly adopted 
in practice, namely, equal probability sampling, varying probability sampling, stratified sampling and 
multi-stage sampling have been proposed, which, while retaining the form of the usual ratio estimators, 
maka thom unbiased. For many of tho situations commonly met with in practice, this modification dis 
given sampling scheme consists essentially in first selecting one unit with probability proportional to 
its value of the characteristic occurring in the denominator of the ratio and then the remaining units in the 
sample according to the original scheme of sampling. The expressions for unbiased variance estimators of 
the unbiased ratio estimators have been given for some of the more important sampling schemes, Further 
the selection and estimation procedures which provide unbiased ratio estimators in the case of a certain 
genoral class of population parameters together with the expressions for its sampling variance and variance 


estimators have also been considered. 
1. INTRODUCTION 


As the relationship between two characteristics is usually of much interest 

» ? 
estimation of ratios of certain population parameters has become quite important in 
The method of ratio estimation is also being used to 


a large number of surveys. 
atio estimator is more efficient than the con- 


estimate population totals, since a r 
ventional unbiased estimator under certain circumstances not uncommon in actual 


The usual procedure of using the ratio method in estimating any population 
ratio or total has been to take the ratio of unbiased estimators of the numerator and 
the denominator and in the latter case multiply it by the population total of the 
supplementary variate taken in the denominator. A disadvantage of this method 
is that the estimator so obtained is biased for many of the selection procedures 
commonly adopted in surveys. Further a completely satisfactory (at least to the 
present authors) treatment of the errors and biases of a ratio estimator is not yet 
available. For small samples; at least, the bias is not likely to be small. 

s attempts have been made to give selection and estimation 
e unbiased ratio estimators. Lahiri (1951) has given a method 
th probability proportional to its total size (pps) (sum of the 
sample) which is essentially similar to his method of selecting 
f selecting a unit with equal probability and including that 
chosen at random from one to an upper bound of the 
e of the selected unit. By 'size' here is meant the 
value of the supplementary variate under consideration. Obviously this method 
d for completely enumerating all possible samples ang finding their 
he cumulated sizes. Once a sample is chosen with pps it is easy 
ed ratio estimator. The disadvantage of the selection procedure 


$ it involves rejection of some draws, 
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practice. 


In recent year 
procedures which provid 
of selecting a sample wi 
sizes of the units in the 
a unit with pps, namely, о 
unit in the sample if a number 
units is less than or equal to the siz 


avoids the nee 
total sizes and tl 
to obtain an unbias 
given by Lahiri is tha 
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Midzuno (1952) and Sen (1952) have independently given a simple procedure 
for obtaining a sample with pps. Their method consists in selecting one unit with 
pps and the rest with equal probability without replacement from the remaining units 
of the population. It may be observed that Lahiri’s method of selecting one unit 
with pps could profitably be used in the selection procedure given by Midzuno and 
Sen. 


In the case of stratified sampling Lahiri has pointed out that his method could 


H а k 
be applied to select a sample with probability proportional to Ў N, т, where k is the 
gel 


number of strata, N, the number of units in the s-th stratum and z, the s-th stratum 


sample mean of the supplementary variate under consideration with a view to get 
an unbiased ratio estimator. Des Raj (1954) has given the expressions for the vari- 
ance and an unbiased variance estimator of the ratio estimator in the case of a 


multi-stage design where the sample of first stage units is selected with pps. 


So far the selection procedures providing unbiased ratio estimators have been 
given only for simple designs and that too for a very restricted class of parameters. 
In the next few sections, modifications of many of the selection procedures commonly 
adopted in practice, namely, equal probability sampling, varying probability sampling, 
stratified sampling and multi-stage sampling, have been given which, while retaining 
the form of the usual biased ratio estimators, make them unbiased. The expressions 
for the unbiased variance estimator for some of the more important cases of ratio 


estimators are also given. Further the selection and estimation procedures for obtain. 
s ain- 


ing unbiased ratio estimators in the case of a certain general class of population 
ss at 


parameters together with the expressions for the sampling variance and the variance 
estimator are given in the last few sections. 


For many of the situations commonly met with in practice, the modifica- 
tion of a given sampling scheme referred to above which provides om niti 
estimator consists essentially in first selecting one unit with probability m itn 
to its value of the variate occurring in the denominator of the ratio iu ae the 
remaining units in the sample according to the original scheme of samplin For 
many of the sampling schemes considered in this paper it might be ex eire in 
large samples the bias of the conventional ratio estimator is decns to be Dos e 
since the form of the ratio estimator is the same in the case of the ed xu the m 
biased ratio estimators and the sample based on the original sampling savas and that 
on the modified scheme could be made the same but for a difference of one mé at the 
most. 


In this paper the estimator and its variance estimator have been given in the 
ее ' Y 
$ ting the ratio R = — wh 
case of estimating x where Y and X are the population totals for two 


characters. The ratio estimator and its variance estimator for estimati Е b 
s ing Y can be 


obtained by multiplying the corresponding estimators in t] we dq 
by X pay, Peel MI he case of estimation of R 
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2. EQUAL PROBABILITY SAMPLING 


In the case of unstratified unistage sampling with equal probability with 
placement, as has been mentioned earlier, Midzuno and Sen have ойі T 
method of selecting one unit with probability proportional to z(p = wl "opti n 
value of the variate oceurring in the denominator of the = хай mo =. ed E 
n—1 units in the sample from the remaining N—1 units in the saei х m 
equal probability without replacement. The probability of getting a parti ^ ice 
8 by this approach is given by T; кка 


ыр x vex (BU) 


where # and X are the sample and the population means respectively. Hence the 


estimator 

в — J 

Е = ... (2.2) 
ate у is an unbiased estimator of the ratio 


ables the usual ratio estimator in the case of 
acement, this is unbiased while the latter 


where 7 is the sample mean of the vari 
R=Y/X. Though this estimator resen 


equal probability sampling without repl 
ariance of В given in (2.2) is given by 


is not. An unbiased estimator of the v 


"Ü ou N—14 
m"— о MR ET m Xj 
о oed | 
du) NnZX su 109.8) 


e efficiency of the unbiased estimator given in (2.2) will be greater 
> ered 


Tt can be seen that th 
corresponding biased estimator according as 
o 


than, equal to, or less than that of the 


" (£ ‚®) = 0. ss (24) 


f the procedure of sampling with equal probability with 


The modification © 
atio estimator would be to select one unit 


provides an unbiased r 
and then select the rest of the (n—1) units from the whole popula- 


ability with replacement at each draw. With this selection proce- 
iyen in (2.2) is unbiased for estimating the pide ^d 
a particular sample in this case is 


replacement which 
with ррх, replace it 
tion with equal prob 
dure the ratio estimator £ 


R, since the probability of getting 


— ОР-Р 
(8) = yw hal ss (UB) 
ial 


f the i-th unit and v is the number of distinct 


r of repetitions 0 
The sampling variance and an unbiased variance estimator of 
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where A, is the numbe 


units in the sample. 
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atio estimator in this case would be different from those in the case of sampling 
i | 1 1 ig iw 
with equal probability without replacement. The variance estimator is given by 


- ИРС 
E АА ПИ? X АА; 
i: i>j 
n(n—1)X£ 


(2.6) 


In the case of sampling with equal probability systematically, an unbiased ratio 
estimator could be obtained by considering each unit as made up of n sub-units with 


each sub-unit of the i-th unit having the size ps and selecting one sub-unit with ppx 


and the other sub-units in the sample systematically proceeding cyclically with 
the sub-unit selected first as the random start and N as the sampling interval. The 
probability of getting a particular sample s is given by 


bad es (Qu) 
m ( 
With this probability scheme the estimator given in (2.2) is an unbiased estimator of 
R. The variance of the ratio estimator in this case is different from those of esti- 
mators based on equal probability selection with or without replacement. Since the 
selection is being done systematically in this sampling scheme, it would not be possible 
to get an unbiased variance estimator of the unbiased ratio estimator from a single 
sample. 


It is to be noted that even if the values of the variate x coming in the deno- 
minator are not known for all the units in the population at the time of selection, 
it is possible to select one unit with ppx by adopting Lahiri’s method prov 


ided an upper 
bound of the values of ж which is not much greater than the m 


aximum value is known. 
The population total of the variate x would be necessary: only if the population total 


Y is to be estimated using v аз the supplementary variate. 


3. VARYING PROBABILITY SAMPLING 
The ratio of an unbiased estimator of Y to that of X b 
scheme, the size being the value of a variate x related to bot] 
and y, is known to be biased for the population ratio R 


ased on pps sampling 
h the characteristics a 
= YIX. Та this section, are 
atio estimators corresponding 
ng with pps with replacement, 


given the selection procedures which provide unbiased r. 
to the usual biased ratio estimators in the case of sampli 
pps without replacement and pps systematically. 


The modification of the pps with replacement scheme consists in selecting first 
one unit with ppx, replacing it and then selecting the rest of the (n—1) units from the 


whole population with ppz with replacement. The probability of getting a parti- 
cular sample s by this procedure is given by 


№ 
i 


i se (8:1) 


—— pcc O E ЕР O y o 


SOME SAMPLING SYSTEMS PROVIDING UNBIASED RATIO ESTIMATORS 


а. X, 
where A; and N are as in (2.5) and p; = Ж where Z = Xz, It may be verified that 


fa 
in this case an unbiased estimator of R is given by 
n 
Эб 
п С рі. 
ipa 
n D; 
i=1 Pi 

It may be noted that the expression for the unbiased ratio estimator is the 

same as that of the usual biased ratio estimator in the case of pps with replacement 
; LS и 1 Sa; : : E 
sampling. For in the latter case > X 7" and = X "are unbiased estimators of Y and 

=. т i=l Ф; n imi Ф; 

X respectively. Of course, the variance and variance estimator would be different 
in the two cases. 

Tf in the above selection procedure the units sampled are not replaced before 
the next and subsequent draws, we get the modified pps without replacement scheme 
which provides an unbiased ratio estimator. But in practice, it is difficult to compute 
the estimate as the computations involved are quite heavy except in some special 
cases, Two such special cases have been considered to illustrate the method. 

(i) ppx and рр: of the remaining (n = 2). The modification of this procedure 
consists in selecting first one unit with ppx and another unit from the remaining 
(N—1) units with ppz (2 being a size other than a). The probability of getting a 
particular sample s (a, =.) is given by 

a, P 4% m, 
Р(в) = 2 .-—22—+у я, dd (3.3) 
It can be seen that this procedure provides the following unbiased estimator of the 
ratio R. 
1 р 
¥4(1—p,)-+22(1—pa) 
Pr Pa 


ED 


(3.4) 
21 (1— py) - 2 0.—p3) 
Pı P2 
of the remaining and then equal probabilities, The modification of 
this procedure consists in following up the procedure explained for case (i) above by 


ило the rest of (n — 2) units with equal probability without replacement from the 
Mare units in the population. TThis selection procedure makes the following 


d for the ratio R 


(ii) Ppr, PP? 


remaining (N — 2) 
estimator unbiase 


(3.5) 
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since in this case the probability of getting a particular sample s is given by 


(3.6) 


Before giving the selection procedure which provides an unbiased ratio esti- 
mator corresponding to the usual biased ratio estimator in the case of pps systematic 
sampling, the method of sampling with probability proportional to size (say the value 
of a character z) systematically will be briefly explained, since this has not become, 
as yet, well known. Let Z, Zo... Zy be the sizes of the units in the population. 


Suppose the i-th unit is made up of nZ; sub-units, each having the value а for the 
UL; 
variate y. The procedure of pps systematic sampling consists in selecting a random 
N 
number from 1 to g(-X 2). The sub-unit having that number is selected in the 


sample together with every subsequent Z-th sub-unit. As the total number of sub- 


units is nZ there will be n sub-units in the sample. Let the sample be ( 


As fa ves В). 
An unbiased estimator of the population total Y is given by 


Sz NU (3.7) 
== = va, Ц 
Tac Ne 
where p; = ра It may be noted that this estimator resembles that used in the case 


of pps with replacement sampling. But the variances of these two estim 
different. 


ators are 


Though the expected number of repetitions in a sample for the i-th unit is 
np; in both the pps with replacement scheme and the pps systematic sampling, the 
numbers of possible repetitions in a sample are different. For instance, in pps with 
replacement scheme, the i-th unit may occur 0, 1, 2... n times in a sample of size n 
whereas in pps systematic sampling it occurs either [np;] or [np]+1 times. As 
the randomisation of the number of repetitions is over a smaller range in the case of 
pps systematic sampling than that in the case of pps with replacement, it is expected 
that the former method is more efficient than the latter, Further the efficiency of the 
estimator based on this method could be increased appreciably by effecting a suitable 
arrangement of the units in the population before selection. Being a systematically 
drawn sample it would not be possible to estimate the sampling variance unbiasedly 
from a single sample. 


The modification of the above method to provide an unbiased estimator of 


ists in selecting one sub-unit with probabili " 
Hox рынын á probability Proportional to v;[p;. Then 


with that as the random start a systematic sample of n sub-units is selected proceeding 
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cyclically with Z as the sampling interval. The probability of getting a particular 


sample s is given by 
P(s) = = =. ... (3.8) 


This procedure provides the following unbiased estimation of the ratio R 


(3.9) 


mh» 

\ 
>. 
J 
н 


n 
LX 
n & р; 
7, T Pi 


4. STRATIFIED SAMPLING 


Let Бе the number of strata and М, and n; be the number of units in the 
ation and the sample respectively for the i-th stratum. For stratified simple 
random sampling without replacement the modification in the selection procedure 
for getting an unbiased ratio estimator consists of selecting one unit (say the j-th unit 
in the i-th stratum) from the whole population with ppx, (n;—1) units from the 
) units in the j-th stratum and n; units from М, units of the i'-th 
hout replacement. The probability of 


popul 


remaining (N;—1 а 
stratum (i Æi) with equal probability wit 


getting a particular sample s is given by 
k 


» NŠ: 


ier) % 


(4.1) 


ample mean in the i-th stratum for the variate v. With this procedure 


where ж, is the s ; gis 
r of the ratio В is given by 


an unbiased estimato 


i В aria Ris given by 
i d estimator of the variance of 
An unbiase 


4 2— [> z Y + J NO Ум + 


— 4 m(n,—l) £ 
x (>A) Е 


И 
i" 


n NN ni т’ 
IAE Yerar -— 
rS کے‎ S ner | = а 
з jal ji 
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It may be noted that É resembles the biased combined ratio estimator of Y. The 
modifications of sampling schemes with other types of designs in the strata can be 
given on similar lines with a view to getting unbiased ratio estimators. 


5. TWO-STAGE SAMPLING 


In the case of a two-stage sampling design with equal probability selection 
without replacement at each stage, the selection procedure for providing an unbiased 
ratio estimator consists in selecting one second stage unit from the whole population 
of second stage units with ppx. If this second stage unit is from the i-th first stage 
unit, the rest of (n;— 1) second stage units to be sampled from the i-th first stage unit 
are selected from the remaining (N;—1) units there with equal probability without 
replacement. The rest of the sample of (n—1) first stage units is drawn from the 
remaining (N—1) units with equal probability without replacement. From these 
selected first stage units the required number of second stage units are selected with 


equal probability without replacement. In this case the probability of getting a parti- 
cular sample s is given by 


Ума 


P(s) = (RR (P) ... (5.1) 


n—i d 


where 2; is the sample mean in the i-th selected first stage unit. This selection 


procedure provides an unbiased estimator of the ratio В = Y 


KE Ya ... (5.2) 


The above procedure could easily be extended to the case of sampling designs with 
more than two stages, as is shown in the next section, i 


6. MULTI-STAGE DESIGN 


The principle involved in giving a selection procedure which 


7 з $ a à — 
unbiased ratio estimator in the case of multi- provides an 


stage sampling is th i 
| s the s 
cases illustrated earlier. That is, one final stage unit is м be e 


and the rest of the units according to some Probability scheme 
simplicity only the probability scheme where the units are sel | 
bability without replacement at each stage is considered here 


selected with ppx 
For the sake of 
ected with equal pro- 


One final stage unit is to be selected first fy 
E om thi В : 
ppx and then the rest of the sample units are to be selected ries еу 


om the remaining units 
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in the universe with equal probability without replacement at each stage. This can be 
achieved as follows. Suppose there are т stages. Опе first stage unit (7,-th) is to be 
selected with ppx and the other (»—1) units with equal probability without replace- 
ment from the remaining (N—1) units. From the first stage unit selected with 
ppx, one second stage unit is to be selected with ppx and the rest of (тг, —1) units 
are to be selected from the remaining (Ni,—1) units with equal probability with- 
out replacement. Similarly from the j-th stage unit selected with ppx one (j+-1)th 
stage unit is to be selected with ppx and the other (ni,/,..j— 1) units are to be 
selected with equal probability without replacement from -the remaining 
(Nii,...j—1) units. (j= 0, 1, 2, (m—1)). From the first and the subsequent 
stage units selected with equal probability, the required number of higher stage 
units are to be selected with equal probability without replacement. The probabi- 
lity of getting a particular sample s is given by 


n 


| = > Na Vj (т) 


i=l 
P(s) ке We —— Jiu n Nu Xt) ... (6.1) 
m—1 n di irig j-1 $152. ..4j 
AX П П. IT 
j=0 і 4=14=1 ġ=1 Niyig tj 
а LT А 4 
This ССН > %\®ө...Йп-1 
= 1 Nii Noei te A 
wien Бө WR Nir.. imi ке 
кы; $2 =1 tits Фт-1=1 j im =1 


ví,.4, being the value of a typical final stage unit. In this case an unbiased 


estimator of 


n 
X Ni, Fim) 


Rahs) —, ... (6.2) 


т 
ru М, Vim) 


i=l 
where, Ji,(m) has an interpretation similar to that of @ (т). 


7. A GENERALISED ESTIMATION PROCEDURE 


In this section a generalised procedure for estimating unbiasedly certain types 
of parameters applicable to a large number of sampling designs is given. For the sake 
of Ou it has become necessary to use some notations which are explained below 


with suitable examples. 

Let & denote a population of finite number of units, say, a universe of N units 
i, and А the class of sets « whose elements belong to Œ. In such a set the 
+ occur more than once. The class of all point sets and the 
onging to £ are examples of the class А. 


Uy Uy von Ù 
same unit may or may no 
class of all pairs of units bel 
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Let the population parameter F be expressible as 


F= У № ... (1л) 
acd 


where f(x) is a single-valued set function defined over the class A and X stands for 
aed 
summation over all sets æ belonging to the class А. For example, the population total 


Y can be expressed as F in (7.1) with æ as a point set (и) and f(x) as y; the value of 
the i-th unit for the character у, and Y? can be expressed as F in (7.1) with « as a 
set with two units {w,, u;}* and with f(x) defined as 


f(a) = 23 iY; а = (us uj, 2з) = 1,191. 
1 


T y C= {ui uj, i= 


Let a sample о be drawn from the population @ with probability P(w). This 
о again is a set whose elements belong to @. It may be noted that the same unit may 
or may not occur more than once in o. The class of all such sets will be denoted by 
Q which will be the total sample space. 


It will be possible to estimate the population parameter F from the sample 
о only if each o contains at least one set æ and each set œ is contained in at least one 
o. 


An estimator of the parameter F is given by 


n E Ле), а) 
P асо 


— Pla) xe (19) 
where, = stants for the summation over all sets o contained in the sample о and 
Ф(о, а) is a function of o and а. This estimator will be unbiased if 
У (0, a) = 1 
ооа, a! 


where, X stands for the summation over all sa i А А 
Кез; mples c which contain &, since 


1 


Е(Р) £X f(x, а) 


wel acw 


sod OMEN CX. Yo, a, a) 


| 


An unbiased estimator of the variance of Р is given by 


ЯР — Ш ane, OMe, a, a) 


* The curled brackets { } are used to denote unordered Sets, that is { 
” » 101, 


uj} and {uj 1 е 
ии j} {uj, ui] are 
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where, X stands for the summation over all pairs (2, x} contained in the sample 
ae cC u 
c and (о, , ж, a’) is a function of œ and the pair (о, с”) such that 


E у(о, а, а") = 1 


шра,а' 


where, X stands for the summation over all the samples containing the pair of sets 


a 
waa’ 


(о, a’), since 
я f(e)f(a )w(o, a, 0). 
E | ca! co 

P(o) 


= E fafa’) E Ув, а, а) = Р 
A €2a,a^ 


The case where (o, ж) is taken as P(o/a), the conditional probability of getting 


the sample о given that the set « has been selected first is of interest as in that case 
it is possible to verify that for many of the designs in general use, this estimator 


Ê= ss TA 


reduces to the usual estimators of the parameter. An unbiased estimator of its variance 


is given by 
X f(@f(e)P(o/aU@') 


Vn =f — 2ece Plo) spo (5) 


is the conditional probability of getting the sample о given that the 


where, P(o[a Ua’) | 
and а’ have been selected first. The above variance 


units in the union of the two sets a 
estimator may take negative values. 
ariance is possible only if every set (a Ua’) is contained 


An estimator of the v 
ains at least one веб ( Ua’). 


in at least one w and every © cont; 


8. UNBIASED RATIO ESTIMATOR 


The above estimation procedure, an estimator for the ratio R of two parameters 


F and G which can be expressed as 


F= = f(a) 
acd 
G(a) = » gla) 


nother single valued set function defined over the class A is given by 


X Хао, a) 


where g(a) is & 


г) |: A m co "cm 
R= S qoo. a) өл) 
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'This estimator will be unbiased if 


X ge), а) 


Р(о) == = ga) (8.2) 
cA 
М = Ј(а)ф(ә, a) 
since E(R) = as (а). 


X о, g(2)9(o, о) 


If g(a)’s are either all positive or all negative the above form of Р(о) can be 
obtained by first selecting a set æ with probability proportional to gl 
the rest of the units with some probability scheme, 
getting c is 


2) and then drawing 
In this case the probability of 


X q()P(oja) 
ЕЕ sae 

acd 
This shows that if in the general case 4(w,«) is taken as Р(о 


|æ), the estimator given in 
(8.1) becomes unbiased for the ratio 
E f(a)P(a[a) 


Y JaPa) ... (8.3) 


ч 


An unbiased ratio estimator of F is given by ў = В. а. (8.4) 


If P(w/«) is independent of the set о, the estimator becomes 


and if Pos is independent of the set а, 


У f(a)h(a 
m Ща) ... (8.6) 


к» 


An unbiased estimator of the variance of ĝ jg given by 


P(R)— ffe "s e, T Р (оја a^) 
А: „2 Ре) — wes (8.7) 
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It is possible that F and G can be expressed as sums of set functions defined 
over more than one class of sets, that is, 


F= У Һа) = 5 Һа) = ее: 
aed aleA! 
G= У fla) = = (0) == +... 


For each such expression we сап give a sampling procedure providing an unbiased 
ratio estimator. From the point of view of operational convenience, it is preferable 
to take that class of sets which contains the smaller sets. The size of a set is judged 
by the number of units it contains. ‘Two examples are given to illustrate the point. 


(a) The population total X can be expressed in the following two ways 


N X Xx. 
i 
x= AL i=l 
2 2 , N-1 
= (= ) 


where S stands for the summation over all sets of m distinct units. In this case the 
former is to be preferred to the latter because in the former only one unit is to be selected 


with ppx whereas in the latter case n units are to be drawn with probability proportional 


to their total size. 


z N 
1 X xx d 
(b) The population variance о? = N DA (w,—X)? where X E > X, can 
=! ici 


again be expressed in the following two ways 


=z" pem . С) S $e» 
n 


\ 
| 
| 


S'(v;—2)*. 


Where 5 stands for the summation over all sets of n units, 2 is the sample mean and 
8' s ads for the summation over all sets of two units. Here the latter is to be preferred 
ie se in the latter case only two units as compared to n units in the 


| mer becau Е à à 
to the for selected with probability proportional to their measure of size, 


former case are to be 
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E It may be verified that all the cases discussed in earlier sections are particular 
cases of the generalised unbiased ratio estimator considered here. The procedure 
explained above will be illustrated by applying it to the question of getting an unbiased 
estimator of the regression coefficient. 


9. REGRESSION COEFFICIENT 


Elec aes i 
jt, —— ... (9.1) 
> (x,—Xy 


i 


The numerator and the denominator can be expressed as follows 


N 
dm) >, у X) 


Thus the parameters И and G are sums of set functions defined over the class of set 
containing only two elements. In the terminology of section 8 и 


Лод = у TYNE- 


да) = у (ХХ 


where о is а set containing two elements. 


The selection procedure consists in selecting a pai ‘ 
pair of units wi efe 
proportional to (X,— X; and the rest (n—2) units with equal Se d pronability 
replacement from the remaining (N —2) units. The conditional as x : ity without 
the sample c given that the pair (i, 7) is selected first is probability of getting 


1 


N—2y 
ise (9 
n—2 (2.2) 


P(ofij) = 


This is independent of the pair of units selected first. 
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Hence an unbiased estimator of 2 is given by 


Bo €t ora 


i.e. p= 2, аа. 


X (v—z) 
ici 


where 2 and ӯ are the sample means. The variance and the variance estimator ean 


be got by referring to section 8. 
Tor selecting one pair of units, with probability proportional to (X;— X; 
the following procedure may be adopted: à 
(i) two random numbers should be selected from 1 to М. (say, ij); 
(ii) a pair of random numbers should be selected from 1 to Max. | X;,—X;| 
(= range of ®); йы. 
(iii) if both these numbers 
d, otherwise it is rejected; 
jected, the operation is to be repeated starting from (1). 


are less than or equal to | X; —X;| the pair (6,7) 


is accepte 
(iv) if a pair is re 


Tt may be noted that unbiased estimators of р? (square of the correlation co- 


/ ) can be got by selecting a set of four elements 


efficient of x and y) and f ( > "à 
first with suitable probabilities and the rest with any probability scheme. 


10. Two-PHASE SAMPLING 


For estimating the parameter F unbiasedly using a ratio estimator with 
ary information, it is necessary to know the value of G. If the 
nd if it is easier and less costly to observe g(a) 


g(a) as supplement 
ance à 
be used to get an unbiased ratio estimator of 


of Gis not known in adv 


value 
hase design may 


than f(a), then а two-pl 
F. 

The procedure consists in selecting a large sample S from the whole population 
ability scheme and observing the value of g(a) for all sets асА which 
o is drawn from S by first selecting а set «eA with 
and then selecting the rest of the sample with 
ased ratio estimator of P is given by 


а 9(0)Р(8/о) 
P(S) -. 


with some prob 
tained in 8. A sample 
al to g(a)P(S/«) 
In this case an unbi 


are con 


probability proportion 
ability scheme. 


„ = f(a)P(S[a).P(o[ S.) 
FS ОРЗ) ^ 


aco 


some prob 
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where P(o[S,x) denotes the probability of selecting the sample c from S given that 


æ was selected first in the process. This probability refers to the second-phase 
sampling. 


The authors wish to thank Prof. D. B. Lahiri and Dr. D. Basu for their 
constant encouragement and advice. 
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TABLES FOR SOME SMALL SAMPLE TESTS OF SIGNIFICANCE 
FOR POISSON DISTRIBUTIONS AND 2x3 
CONTINGENCY TABLES 


By I. M. CHAKRAVARTI 
and 
C. RADHAKRISHNA RAO 
Indian Statistical Institute, Calcutta 


SUMMARY. Extended tables of critical values for a level of significance a < 0.05 are provided 
for the variance and likelihood tests for homogeneity, goodness of fit x? and likelihood tests, and devia- 
tion in the zero frequency for samples from a Poisson distribution. A table of critical values for the 
variance test of homogoneity of samples from truncated Poisson distribution is also given. 


For the binomial population, the variance and likelihood tests for homogeneity have been given 
and tables of critical values worked out, when there are three independent samples of the same 


size, 


Illustrations have been given explaining the use of the various tables. 
| 


0. INTRODUCTION 


In an earlier paper, (Rao and Chakravarti, 1956) exact tests were provided 
for (i) homogeneity of samples and goodness of fit а Poisson and truncated Poisson 
populations, and (ii) deviation in the ‘zero frequency’ for Poisson and binomial popu- 


lations. 


Tables of critical values were given for a level of significance а < -05, for the 
variance and likelihood tests for homogeneity, goodness of fit y? and likelihood tests, 
and deviation in the zero frequency for samples from a Poisson distribution, 

n 


These tables have been now extended. A table of oritical Values of the 

Р test for homogeneity of samples from undated NOE distribution is ‘also 

md Variance and likelihood tests for homogeneity are derived for the binomial 

ise E e! are given for both these tests when there are three independent 
populati 


samples of the same size. 
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1. TEST CRITERIA 


1.1. Poisson population. Let x, Xa, .--, y be f observations from Poisson 
populations; and f, denote the frequency of the variate value r in the sample. Denot- 
ing Df, =f, 2 = T, the statistics proposed as test criteria were 


Xa?: variance test for homogeneity; 
X =, log, 20; : likelihood test for homogeneity; 
X f, log. (r! fr) : likelihood test for goodness of fit; 
fo : frequency of the zero class for testing excess in that frequency. 
The conditional distributions of these statistics given T and f were derived. 


Tables of critieal values of these statistics have been now extended to cover 
the cases of T — 11,12,for f = 3(1) 10 (10) 100. 


12. Truncated Poisson. The analogue of the variance test for homogeneity 
in the truncated case was found to be 


X(z,—2)? -8(14-m —8) PES 


fi 
==: 93 = п = —т ДЯ д » " 8 
where T ore: a, = fF, = т(1—е )and a,, v, ..., vy are f’ observations from 


a truncated population. The conditional distribution of the statistic (1.2.1) for fixed 
Xx, and f' is the same as that of У 22; so one can use У a? instead. Critical values of 
this statistic are given for f’ = 3 (1) 9, T = 8(1) 12. 

1.3. Binomial population. If f sets of s trials are made with probability 
р of success and 11, 73, --., 27 denote the number of successes in the different sets 
then the conditional probability of j, tə, ..., u, given X a, = Т, f and s is 


TYS—T)! т в) 


The statistics which are analogues of variance test and likelihood test for 
homogeneity are 


s&(x,—#)?/#(s—®) and У v;logo;--X(s—2;)log, (s—2;) 


respectively, where 2 = T'/f. For the former, one may use X 22, since the conditional 
distributions are being considered. i dona 


Tables of critical values are provided for these statistics for f= 3, s = 3(1)10 
T = 3(1)26. : 2 


316 


— S 


——— << WM NM ج‎ 
у 


SOME SMALL SAMPLE TESTS 


To facilitate the computation of likelihood statistics an auxiliary table of 


n log, n for n = 1(1) 100 is also given. 


2. USE OF TABLES 


2.1. Critical values. Below each critical value is recorded the exact level 
of significance. When this is too low, the next lower value of the criterion is also 
recorded with the corresponding probability level if it is not much above 5 percent. 
If the observed value of the statistic is equal to or greater than the tabulated value, 


then the null hypothesis is rejected at the level of significance indicated. 


2.2, Examples: (a) Poisson distribution. 


' no. of accidents 0 1 2 3 total 


frequency 32 5 2 1 40 


T = 0X32+1X54+2X243X1 = 12. 


(i) Homogeneity (variance test, Table 1). 


Уа? = Uf, = x5--22x234-3?x1 = 22. 


This value being equal to the critical value for T = 12, f = 40, the null 


hypothesis of homogeneity is rejected. 


(ii) Homogeneity (likelihood test, Table 2). 


У x; log, аң = U(r loge r) f, = 6.068. 


Critical value of this criterion for t = 12, f — 40 is 5.5. 


The computed value being greater than this, the samples cannot be regarded 


as homogeneous. 


(iii) Goodness of fit (likelihood test, Table 3). 


ху, loge (fr?!) = Ef log, f,--Z f, log, r! = 123.025. 
se to the critical value of 124.76 thus indicating departure 
not significantly so. The goodness of fit test is, 
ticular departures from the null hypothesis. 


The observed value is clo 
from the null hypothesis though 
probably, not very sensitive to par 
frequency’ (Table 4). 
ro = 32. 


— 
i zero 

(iv) Test for the deviation in | 
Frequency of ‘ze 


Since this value is equal to the critical value, a significant excess in the 
i 


frequency of the zero class is indicated. 


(b) Truncated Poisson. variance test, Table 7). 


Homogeneity ( 
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In a study of a rare abnormality in children, 5 families of size 8 reported the 
following number of abnormal children, 6, 3, 1, 1,1. Can these samples be regarded 
as homogeneous ? 


Неге У а = 6+3?+12+4124+1% = 48. The corresponding critical value for 
f= 5 and T = 12 is 46 with a level of significance .05. Since the observed value is 
larger, the hypothesis of homogeneity is rejected. 


(c) Binomial. Three different preparations A, B and C of a drug were to be 
compared for a certain response in mice. Accordingly, 30 animals matched for age 
and litter were allotted to the three groups at random and the following data were 
obtained. 


preparations responding попгеѕроп ding total 
Б 9 10 
8 8 10 
8 2 10 


Ате the preparations equivalent in producing response ? 


(i) Homogeneity (variance test, Table 5). 


Here Т = È ж = 1434-8 = 12, Ха = 124.324 g2 — 74. 


The observed value of Хай is greater than the corre 
value 66 for Т = 12, s = 10, at a level of significance 0.03. 
a significant difference in the effects produced by the prepar 


sponding critical 
Hence the test indicates 
ations, 


(ii) Homogeneity (likelihood test, Table 6). 


У x; log, v;--X(s—2;) log, (s—a;) 


= 1 log, 1+3 log, 34-8 log, 8 


+ 9 log, 9+8 log, 8+2 log, 2 
= 57.728. 


The corresponding critical value being only 52.987 for T — 19 
igni .03, the test establishes signi ; ‚ 8= 10 ata 
level of significance .03, s significant differences г 
t е 
preparations. etween the three 
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SOME SMALL SAMPLE TESTS 


TABLE 7. VARIANCE TEST FOR HOMOGENEITY 
(TRUNCATED POISSON DISTRIBUTION) 


Statistic: Ха? 


Г’ = number of observations. T — total of observations. Values of Saj2 greater 
than or equal to the tabulated values are significant at a level indicated in 
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EXPECTED VALUES OF MEAN SQUARES IN THE ANALYSIS OF 
INCOMPLETE BLOCK EXPERIMENTS AND SOME 
COMMENTS BASED ON THEM 


By C. RADHAKRISHNA RAO 
Indian Statistical Institute, Calcutta 


SUMMARY. Reference has been made to the logieal status of Fisher's null hypothesis that 
all varieties under test have the same yield on any experimental plot. The types of departures from the 
null hypothesis, which the ratio of mean squares of varieties to error can detect with a reasonable chance, 
have been examined in the ease of general incomplete block designs. It appears that the test ignores dif- 
ferences in varieties which are, in some sense, avtributable to interaction between blocks and treatments. 
A study of the consequences of non-random allocation of subsets of varieties to blocks leads to a special 
property of the BIBD and some PBIBD designs. The effect of random indexing of varieties, i.e., of asso- 


ciating the given varieties with the symbols in which a design is represented, is also considered. 


1. INTRODUCTION 


In earlier papers (Rao, 1947, 1956) on general methods of analysis for in- 
complete block designs, the author has shown how combined intra and inter-block 
and the expressions for their variances and covariances can be obtained 
of least squares under the hypothesis that treatment and plot effects 
are additive. The present paper is intended to clarify some of the points not fully 
elaborated in the earlier papers. Further, accepting Fisher’s null hypothesis that an 
observed yield of a variety on a particular plot is purely a plot effect independent of 
the variety," the types of departure: 
of variance test can detect have been examined. The latter is done by comparing 
the expected values of the mean squares for varieties and piv in the analysis of variance 
under a general hypothesis that on each plot the в haye possibly different 
yields, and plot X treatment and block x кыйн; interactions exist. The first 
attempt in this direction was due to Neyman (1935), who arme wig expectations 
for Randomized block and Latin square designs, Recently Wilk (1955), Wilk and 
Kempthorne (1957), and others have considered the two cases treated by Neyman 


under a more general set up. 


estimates 
from the theory 


s from the null hypothesis which the analysis 


sometimes stated as the equality of varieties with 
all plots of the experimental area although plot x treat- 


t 1 block x treatment interactions maj exist (Neyman, 1935). Some concern 
ment and E 3 y ln " 

| hen it is found that under these conditions the expected mean square 
1 ; is 


aller than that for error implying that the analysis of variance 


and probably insensitive. 


The null hypothesis is 
respect to the total yields over 


is expressed w 
for varieties is sm 
ratio test is not unbiased 
y be noted that when the 
al. there must exist portio 


total yields of some varieties over a given 
ns of the area over which they must differ 
П hypothesis that the total yields are the 


It ша 
area are all equ 


aa ; 
if interactions exist. The validity of the nu 
eract: 


stinction is made between variety and the more general term treatment. 


1 In this paper nO di е 
They have been used synonymousty. 
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same over an experimental area then depends largely on what particular area 
has been chosen or was available for the experiment. Such a null hypothesis has, 
therefore, not the same logical status as Fisher's null hypothesis. We may now raise 
the question as to what type of departures from Fisher’s null hypothesis are detect- 
able by the variance ratio test of mean square for varieties to that for error. It turns 
out, in the cases examined before such as Randomized block and Latin square 
designs as well as for other designs considered in the present paper, the variance ratio 
test has only a small chance of detecting overall differences equal to or smaller in 
magnitude than the block x treatment interactions. Differences comparatively larger 
than the interactions have, however, a reasonable chance of detection. It is, perhaps, 
a desirable property of the test that it should ignore overall differences of the order 
attributable, in some sense, to the presence of interactions. The object of the experi- 
ment may not be to examine whether differences exist over the experimental area used, 
but to look for evidence whether the results of the experiment would justify an investi- 
gation on a large scale, over a wider area. The variance ratio test seems best suited 
for this purpose. Indeed, if the experimental area itself is chosen at random from a 
wider area the ratio of variances provides an unbiased test for examining the differences 
in varieties over the wider area. 


Section 2 of this paper is devoted to a brief restatement of some of the results 


of the earlier paper (Rao, 1947) to clarify some of the statements made earlier and 
to explain the new notations used in the present communication. 


2. RANDOMIZATION ANALYSIS OF INCOMPLETE BLOCK DESIGNS 
UNDER AN ADDITIVE MODEL 


Let us consider an incomplete block design involving v varieties arr: 


nsi anged in b 
subsets of k varieties each, such that every variety is used + times and any pair of 
а 


varieties g and h occurs in Aj, subsets. The actual layout of the experiment in b 
blocks of k plots is determined by the following randomization procedures 


Ву: The subsets of varieties are assigned to the blocks at random 


R,: Within each block, the varieties of a subset are assi 


at random. gned to the plots 


The null hypothesis specifies that all varieties give the same yield on each 
plot of the experimental area. We may, however, - 


write the yield of - i 
on the j-th plot of the i-th block as MCN de, 


TyF j S 
gti g jv (2.1) 


where the parameter 7, is specific for the g-th variety and g.. 
treatment and may be considered as a plot effect, x 
null hypothesis under test is 


; is independent of the 
With the specification (2.1) the 


Но: тр Ta = ree = Te 
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کک چا ت و یت 


EXPECTED VALUES OF MEAN SQUARES 
Let us define, with the usual notations for averages, 


zz (=; —5;-)* 


epe i] 


2 
‚ Съ 


the 2 te " 1 "o avers zari 1 79 5 : 
so that o} is the inherent average variation between plots within blocks and оў is the 


variation between blocks. 
without loss of generality, 
Then it is easy to see that ur 


| Bix!) = тт. 


Consider a particular subset of varieties, say Tj, ..., Th 
sus Oe 

and represent the corresponding observed yields by z,...,a*. 
E 


ıder the randomization procedures R, and Rs, 


T ЕК1 о = 
V(x!) = = T о + ~ о? 


ПЕ 


and there exists an orthogonal transformation 


Bl Vk = (+. -Еа®^)]/Ё ... (2.2) 
Yi = ba E E a ES (2.3) 
DE k—1 


such that the new variables are all uncorrelated and 


V(B| vk) = k(b—1)oijb 
Vy) = 05, t= hen kl 


E(y) = ba it быть 


ith b blocks provides 000—1) observations of the type (2.3), which 
d, have the same variance 02 and have as their expectations linear 

Hence the theory of least squares can be 
ates of treatment differences т,—ту, expres- 


An experiment W 


are all uncorrelate 
f the unknown P 


used for obtaining the best linear estim 
s of estimates and the anal 


This supplies the theory © 
thogonal transformation of the block totals 


functions 0 arameters T. 


sions for variance ysis of variance for testing any set of 
linear hypotheses. f intra-block analysis. 


Tf, in addition, we make an or 


(divided by ®) 
= (В+... ВМ 
2; = (aB +- Не BU V E "Nr 
al би 
(4) = koi, i = 1, ...,0—1, and E(z;) is a linear 
observations (2.4), each with variance 


2 are uncorrelated, V 
al effects. The (b—1) 
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ko? together with the b(k— 1) observations (2.3), each with variance с? yield combined 

b o pi Е 
intra and inter-block estimates by the use of weighted least square theory. The 
expressions for variances etc., involving the reciprocal of the variances 


w-l[oi and w = фо? 


are given in the author’s earlier papers (Rao, 1947, 1956) where a different interpre- 
tation was given to оў. 


One useful result of the earlier papers was the derivation of the for 


mulae 
for varietal differences, and variances and covariances in such a w 


ay that the same 
expressions can be used both for intra-block analysis and combined intra and inter 


block analysis by substituting appropriate values for the parameters, 

'To obtain estimates of 02 and ту, we use the expectations ог me 
the analysis of variance of total sum of squares into blocks, v. 
Table 1 contains the relevant expectations. 


àn squares in 
arieties and error. 


TABLE 1. EXPECTATIONS OF MEAN SQUARES ASSUMING THE SPECIFICATION (2.1) 


due to d.f. 8.8, expected mean square 
blocks (ignoring varieties) bak Hà hoy, i Еа EU = ri ) (тт) 
varieties (eliminating blocks) visse 7 9$ ==) ё, Mj(ri— rj)? 
error g Bn А 
varieties (ignoring blocks) 9—1 5% E. ор % + = Eine 
Weed wu die E % o 


Note: g =bk—b—v+1+e, с = degrees of freedom confounded, which is (v 1) minus t 
§ S є s = us 
indopendent varietal contrasts estimable from int ra-block analysis, The value 


sign for 
.» could be utilized in computing the sum o 


directly Sp.) (or р). 


he number of 
of c is zero when all 


varietal trials, The relationship 
f squares Sov (or Spr) afte 


varietal differences are estimable as in any connected de: 


tS. = 464-5 
Sbt Si b r obtaining 


А 2 з. ^" 
The estimates of т? and oj are obtained usin 


g the mean squar 
f s res who: : 
tations are free from varietal differences. SR 69 
Ав __ у 
ёр = 6; — 9 


tion are assigned to contiguous blocks so that the Variation, between т licati 
be removed from the block differences. Defining o? as the Жайы a coula 
cations the expectations given in Table 2 are obtained, ы а 
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TABLE 2, EXPECTATIONS OF MEAN SQUARES IN THE FURTHER ANALYSIS OF BLOCK 
VARIATION, SUMING THE SPECIFICATION (2.1) ` 
шана а 


due to d.f. 3.3. expected mean square 

replications (ignoring blocks) т—1 Sr ъс 
blocks (within replications b—r Sow ое „е, ВИ в 
p ) ES Rcs ШҮ бт 98 


blocks (climinating varieties) ь—1 Sow =r 


Tn such a case, the estimate of о? is 
82 = [Sy — (v —kc— 18, /gk] + (0— 007—1). 


For further details the reader is referred to Rao (1947). 


3. EXPECTATIONS OF MEAN SQUARES WHEN R, IS NOT FOLLOWED 

We shall consider the situation when the subsets of varieties are not randomly 
assigned to the blocks but only the varieties within a block are randomized, i.e., only 
the procedure R, of Section 2 is followed. Tn such a case only intra-block estimation 


is possible. Considering the set up (2.1) let 


oi = 5 (а„—.)# + (0—1), Bi уз 0 cow 80) 
be the variation between plots within the i-th block. The transformed variables 
Tas x ез considered in (2.3) arising out of the subset assigned to the i-th block have 
Since о is unknown and cannot be estimated unless some varieties 
an once in each block, it is not possible to adopt the proce- 
(of weighting the variables of each block by the reciprocal 
Let us, therefore, examine the consequence 
ation and tests of significance of varietal 


the variance ту. 
are repeated more th 
dure of weighted least squares 
of the corresponding intra-block variance). 
of ignoring the differences in gi, in estim 
differences. 


A A Š 
Let Vij оў denote the variance of (t;—T;), the intra-block estimate of the 
; +h varieties, assuming a common vari 2 
difference between the i-th and j-th v arieties, à g 0 variation с? for all 
rious types of standard designs discussed in 


the blocks. The expressions vij for va | de 
literature are known. Let à, be the sum of v; for all possible pairs û and j of varieties 


included in the g-th block. Thus, if the s-th block has three varieties designated by 
1.4.5 then 6, = vytis ГУ The expectations of mean squares for varieties and 
ЕД 3 8 Ы 


error in the intra-block analysis of variance are given in Table 3, 
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TABLE 3. EXPECTATIONS OF MEAN SQUARES ASSUMING THE SPECIFICATION (2.1) 
WHEN В! IS NOT SATISFIED 


ÀÀ € —‏ س 


due to d.f. 8.8. expected mean square 
blocks (ignoring varieties) b—1 Sp — 
1 4 i 1 Я 
varieties (eliminating blocks) 0—1 Sv-b = (б, 95, +. .4 8° en) Koni PAD т)? 
j<i 
ЫЕ) 2 _1 А e 
error g Sp a 7 ) > Ag (8, 0911-18 оз) 


2 : 
05 = (051 -...- 03) + b 


From Table 3, we find that when т, = ... = r,, the expected mean squares 


for varieties and error are not, in general, the same. By equating the coefficients of 
оң in the two expressions, we obtain 


à _k-1_ 6; _ k(v—1) 
kv—l) yg kg ot pi = b 


so that the necessary and sufficient condition for the equality of expectations for error 
and varieties is that б, are all equal, i.e., the sum of variances of all comparisons within 
any subset of varieties assigned to a block should be the same, the variances being 


computed, in the usual way, under the assumption of no difference in intra-block 
variances. 


Tt is easily seen that this condition is satisfied for the BIBD and PBIBD of 
the two-associate type with the special values A, = 1, Az = 0 such as the quasi-factorial. 
It may be of some interest to characterize an experimental design by the presence 
or absence of this property. Even if this condition is satisfied, there is the additional 
difficulty of estimating the exact variance of the estimated difference between two 
given varieties. The exact variance in such a case is a linear compound of the 
the expected mean 


intra-block variances, which, in general, is not a constant multiple of 

square for error except in the case of complete randomized blocks 
4. EXPECTED MEAN SQUARES UNDER A NON-ADDITIVE MODEL FOR А p 

F IBD 


In the general case of a non-additive model the following notations and defi 
8: en- 


nitions are used. 
G) 25 = yield of the a-th variety in the j-th plot of the û 
(ii) xf, wt, 2, 

by dots 


-th block 
= тереч gatos ng averages over the suffixes replaced 


(iii) Variance between plots within blocks 
Щй—1)о%(а) = X X (ws ze, 


О == А 
(iv) Variance between block-means 


v 


(0—1)оа) = E (1—2), a= T gu 
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(v) Interaction variances, plot x treatment, within blocks 


b(k—1) ia, с) = 1E У {8—2 (1) a, 6 = 1, ..., 0 


(vi) Interaction variances, treatment x block, based on mean values of blocks 
E РРС 
(0—1) а, с) = iE м (Elat age = 1,....0 
(vii) Average variances summed over the varieties 


Bie 2a) — 
о% = У ода) + v, оў = У оа, a) + v 
a 
m Е , ў 
2 = У У йа, с) + 000—1), i = E = а, с) + (0—1) 
ас 
The expected mean squares in the case of BIBD are given in Table 4 


TABLE 4 EXPECTATIONS OF MEAN SQUARES IN THE ANALYSIS OF A BIBD 
(NON-ADDITIVE MODEL) 


expected mean square 


due to d.f. 5.3. 
eM В singe =k 2 
blocks (eliminating varieties b—1 Sp. ы: 2 g ;2 g در‎ ,9(r—1) 2 
ks ( g ) bv 01) 2 + 51) Е | om ) o; 
varieties (eliminating blocks) 0—1 Sob gi dg que ge 27 ox 
8 p $ Tici b ke 1) ®(та—т)? 
6; о? — lj : 
7 1 p toth 


error 


ons in Table 4 under the general set up enable us to examine the nature 
arieties which the analysis of variance test can detect. 
Under the null hypothesis, that all varieties have the same yield on each plot of the 
the expectations of the mean square for error and varieties are 
in Table 1. Tf this null hypothesis is not true then the expected 


arieties exceeds that fo 


The expectati 
of the differences in the v 


experimental area, 


the same as shown 
r error only when 


mean square for v: 


¢ Ща 
or ogaao "E 
where о? = X(r,—7* 100—1), the variance between varietal effects. The relation- 
к ance test has a reasonable chance of detecting 


ship (4.1) shows that the analysis of vari 
ures from the null hypothesis only whe 
tude depending on the block 
For examining the hypothesis о? == 0, although i$ and ij may not be zero, 

omewhat conservative as in the case of Randomized 


s of variance test is S : in the 
designs (Neyman, 1935); the position is, however, slightly 
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better for a BIBD. The difference between the expected values of mean Squares 
for varieties and error in this case is [(—1)/(v—1)]i? which is small when £ is small 
compared to v. The corresponding difference for randomized blocks is i; apart from 
some difference in the magnitude of 7} itself due to increased block size. 


5. EXPECTED MEAN SQUARES IN THE CASE OF A GENERAL 
INCOMPLETE BLOCK DESIGN 


For a BIBD it is seen that under the hypothesis c, = 0, the expectations of 
mean squares for varieties and error agree upto terms containing intra-block variances 


and plot х treatment interactions. But this may not be true for a general 


incomplete block design. Let us consider a block containing variety а with (k—1) 


others denoted by 1, 2, ..., £—1 without loss of generality, and define ov. 


„ as the vari- 
ance of the least square intra-block estimate of the contrast 


(k—1)r, —7,—--- Тр 


under the additive model (2.1) where 95 is the intra-block error. For any standard 
design for which intra-block estimates are provided, the value of V, can be obtained 
directly by first estimating the contrast and computing its var 


iance. Or, if oy 


p'ij 
stands for the variance of the intra-block estimate of (т;—т;), then 


Уа = 5 Ву, Ум; (5.1) 
where in the summations û and j vary over a, 1, ..., k—1. 


Let Х,У, = sum of V, for blocks in which the varieties 4 and а occur and 


1 
& — $3 Za F 


k 


" 1 
Lue == 13 (3, ТТЛ. 


The expected value of the sum of squares due to varieties eliminating blocks is 


2 SSE. 4$ = л, к |2 1 
D EH да, o) (орны, its Veg Anlar) a (вз) 


and that due to error is 


Nee qusc. 

E. THp (а, с)... (54) 
The condition for terms involving c, (a) to be equal in the two expectations (5.3) and 
(5.4) is | 


{ж ЖАКУ [e] Ba, c) - xx 


£, = =. (independent of a), 
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This is true for b1 BD, PBIBD of the quasi-factorial type, and linked block designs 
(LBD), though not for a general incomplete block design. Similarly for the гаа 
+ H S 
involving plot x treatment variances to be equal 


oL ИА, 


Sae = b(k— 1)? (5.6) 


which is true for BIBD and PBIBD of the quasi-factorial type and not LBD. The 
condition for terms involving block x treatment variances to be equal is 


д @=Ver—e+)) | 


42 (0—1) (5.7) 


Sac 
It may be examined whether this relation can ever be true and if true, what should be 
the nature of the design. 
n enressi ^ "^ B н 
The actual бешер for LBD are computed to show the disagreement in 
the plot x treatment interactions terms. The expected sum of squares for varieties 
is, apart from the term involving varietal differences, 


on sss Ts ( 1 , T—ÀsY;2 ани = 
1 ходи) ZZ ee Hee Vt ay Ха, o) s. (8.8) 
and that for error is 
9 зо I us b—1 _ 7-4 la 1 ух 6—1 тА, Vs 
7 хода) ур r8 E «Eum Jiane) ... (59) 
where js is the number of varieties common to any two blocks in a linked block 


design. 
For the coefficients of (а, с) in (5.8) and (5.9) to be the same, A,, should be 
constant, in which case the design isa BIBD. How far the disagreement in the inter- 


action terms can be considered as a drawback of the LB design remains to be 


examined. 
As observ 


o3(a) do not agree and it m 
espect to the term: 


ed earlier there exist designs for which even the terms involving 
ay be of some interest to obtain a classification of the 


designs with r sin which the expectations of mean squares for varieties 
ac 


and error agree. 
6. RANDOM INDEXING OF VARIETIES 


andomized block, Latin square and BIBD designs, the associa- 
tion scheme for the actual varieties is independent of the correspondence set up between 
the varieties and the symbols in which a design is represented, Thus, in a randomly 
chosen Latin square of order 4 using the symbols A, B, б, р it does not matter which 
of the four varieties is made to correspond with A, which with B and so on. But in 
a design like the qu asi-factorial obtained by — as blocks, the rows and columns 
of a square with s? symbols written in the $ cells there is the further problem of 


In complete г 
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assigning the varieties to symbols. In this design differences of varieties chosen to 
correspond with symbols occurring in the same row or column are estimable with a 
higher precision than those not in the same row or column. So, if certain comparisons 
are deemed to be more important than others it may be possible to determine a cor- 
respondence which allows the estimation of these comparisons with a higher precision. 
If no such distinction could be made among the various possible comparisons then 
we may follow the procedure R, stated below. 


В, : Obtain the correspondence between the varieties and the symbols in which 
a design is represented by randomly permuting the symbols over the varieties. 


It is easy to verify that whatever may be the design used, with respect to the 
reference set generated by the randomization procedures R,, R, of Section 2 and В, 
stated above, the following are true. 


(i) The variance of the estimated difference between any two varieties is 
a constant independent of the varieties chosen. 


(ii) The expected mean squares for varieties and error are same as those 
obtained for a BIBD (Table 4). 


However, it is not suggested that the procedure R, justifies attaching the same 
precision to all estimated differences from the results of an experiment when the design 
is not balanced. 


Note: If each observation in an experiment is subject to an additional 
independent random error (known as technical error) with variance o2, the expecta- 
tions of mean squares in all the Tables of this paper will have the additional Sh 2 
with coefficient unity. If 02 is large, the estimation of variance of block. t Ж 
presents some difficulty. Some aspects of this will be considered in : 
publication. 


als 
à subsequent 
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SOME REMARKS ON THE MISSING PLOT ANALYSIS 
By SUJIT KUMAR MITRA 


Indian Statistical Institute, Calcutta 


SUMMARY. The analysis of variance of incomplete data from randomised block and latin 


square experiments is considered end the expected values, under the null hypothesis, of the treatment 
and error mean squares are obtained. For simplicity, only the case of a single missing observation is 


considered. The results could be similarly extended to the case of multiple missing observations in a 


more general situation where the null hypothesis need not be true. 


1. INTRODUCTION 


It is now recognised that the justification of the customary F-test in ANOVA 
for designed experiments has to be sought elsewhere and not in the normality and 
independence assumptions of the observed random variables (an assumption which 
is most certainly untrue). Several authors (Neyman (1935); Welch (1937); Pitman 
(1938); and more recently Kempthorne (1955); Wilk and Kempthorne (1955) among 
investigated the possibility of validating this test as an approximate randomi- 
sation test. According to them, the stochastic character of the observed variables 
is primarily due to the random assignment of the treatments to the experimental 
i ible (theoretically at least) to write down their joint distribution 
as soon as the randomisation procedure R is specified. Consider an experiment in- 
volving М experimental units where it is desired to gun t шенеп; Let 
X,(k) be the (hypothetical) yield of the u-th experimental unit Lies it receives treat- 
ment k(u = 1, 9, ..., N, Ё=1, 2,..,0) Lhe treatments are said to be equal in 
their effects if every plot gives the same yield irrespective of the treatment applied, 


ie. if, 


others) 


units and it is poss 


X (1) = X,(2) = xw. چک‎ Xt) for u = l; 2, ө N. Sx (1.1) 


not be interested in establishing such a stringent hypothesis 


deviations from (1.1) only in so far as they imply, 
in deviations from 


Usually however we shall 
(1.1) and are satisfied in detecting is fro 
differences in total yields (over all the № units), i.e., 

я X) 2 3,2) == = x Х, (9. 


u 


(1.2) 
u 
T ent this is achieved by the ANOVA F-test in some classical designs (like 
ihe age nied Block Design, Latin Square ete.) has been rather thoroughly examined 
b 9 Randomise thors and for & discussion on this subject the reader is referred 
CUR im i book (1952). All their researches tend to show that the F-test is 
а EE ROHS iue ense, namely, if (1.1) be true, both the numerator in F(the treat- 
Unbiased in а certain Sem" ave the same expected value. Condi- 


i inator (the error m.s.) В 
ment m.s.) and its denomina к t 
tions ins which this is true under (1.2) also are known. The object of the present 
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paper is to demonstrate (in two simple situations) that this is no longer true with the 
ANOVA F-test when the yields on some of the units, in an otherwise well-designed 
experiment, are missing. For computing the expected values we consider independent 
repetitions of R, with the same set of experimental units reporting missing yields each 
time. They are derived making use of certain known results concerning the average 
values of mean squares in the analysis of such experiments with complete data. 


2. RANDOMISED BLOCK DESIGN [ONE PLOT MISSING] 


Here the experimental units (plots) are arranged in r blocks of ¢ plots each 
and in each block the ¢ treatments are assigned to the # plots completely at random. 
Let X,; (k) be the yield of the j-th plot in block i, when it receives treatment k and xj; 
the observed yield of treatment kin block $. For simplicity of discussion we shall 
assume that the Ist plot in block 1 is missing which under R had received treatment 
m (itself a random variable), so that 2,,, is reported to be missing. In such a case 


Yates’ method of fitting constants (1933) for estimating the missing yield leads to the 
following estimate for 21, : 


^ 


= "В: +t Lin — e 


Lig = ti 
im (r—1)(L.—1) ... (2.1) 
pn 
where Bj = total yield for the (t—1) plots in block 1 for which yields were obtained 
= Ж 
Кит “ak 


T^, = total yield for the (r—1) plots of treatment m for which yields were 
obtained = Ж 2 
1 
G' = total of all the observed yields, 


and the ANOVA table is obtained as follows: 


TABLE 2.1. ANOVA FOR A RANDOMISED BLOCK DESIGN 
(ONE PLOT MISSING) 


—————, 


sources of d.f. 
variation sum of squares 
treatment i (Ту (obtained by subtraction) 
error (r—1)t—1)—1 (E)m (obtai " 
dum жу 188 the error s.s. in tho completed 
ata Inserting xn for the missing 2131) 
treatment - error r(t—1)—1 (ТЕ) у B 2 2 " 
—À DE 1 umm Bi 
т ak E i A i (5) 


Bus tig 
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Let (T), (E), апа (T4-E), denote the sums of squares due to Treatment, Error 
and Treatment + Error respectively computed from the complete data, if Vym were 
available. Then the following lemma can be easily established. 


Lemma 2.1: 


(E),—(E)m = Se (оа) 


» $9 


; tl _ By 
(7+8), (THE) = 17 (ви 74) 


Hence а), = EEE [E 69 а) -— 


and ETHE) = 600,8 |. (ns — ВЫ). ... (2.3) 


Let us now assume that (1.1) is true, i.e. 


х1) = X,2) =... = X(t) = X for all (ij) 


and write 


1 
ey = Xy—X;. where ет = : ... (2.4) 


In this case 


2 
2 B 2 2 
By PC ii == Se es 
and (а n 24) m (t—1» (а t ) (t—1)? 
і мв = G(TAE) np. 
зуб (ТЕ), = Y 2 e — (t—1) бге REPE ui 
у we have 
A " & UD) = zx. У 
Iso since i-2 
^ Bi (2.6) 
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and hence 


2 
" 


m В! ^ 
Eltim Lm) = (а p= i) F V (21m —* ym) 


1 
EE 4 г ' 
-IP en + (11) V(T,) 
Ec a" t B ug 
Gap t+ oig E, e 
It is also known that 
&(E), = (—1yt—1)4. .. (2.8) 
А A’ 
| —nts. 4 
Hence 60), = по N 4-5 
t 
and ETa = АА (2.9) 
1 T 
h '"mp2l4 
where И ... (2.10) 


The expected values of the treatment mean square and the error mean s 
shown in Table 2.2. Square are 
TABLE 2.2. EXPECTED VALUES OF ME 


AN SQUARES 
IN TABLE 2,1 аныш 


sources of d.f. ex 
ces Xpected value of 
variation mean square 
treatment t—I AXE. Тот, 
+ Т) (А’—А*) 
error Peri 
Aty 1 
ray (4*-4) 
a- аут [224 hy 
D-Ill 9 Ern 


Hence the F-test would be unbiased if and only if 4* — A' 
Thus if the average error variance in the (r. 

the error variance of the incomplete block 1, > e E Када i vane Sha 
ment mean square would have a larger expected value th, ; and then the treat- 
опоре we well Berean № [wes не v fin, the error mean square. 
under the null hypothesis (1.1), than what we normally 96 Р values even 
Jevel of testing. Mcipate at the nominal 
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SOME REMARKS ON THE MISSING PLOT ANALYSIS 
3. LATIN SQUARE DESIGN [ONE PLOT MISSING] 


Here № = i? and the 2 plots are arranged in ¢ rows and ¢ columns Th 
treatments are assigned at random to these plots in ‘such а way that each ие ot 
seuss once in every row and once in every column. The randomisation sano 
R in a Latin square is discussed in Fisher and Yates Tables (1948). Let Xj(k) ү 
the yield of plot (i, j) (i-th row and j-th column) when it receives treatment Ё cud 
the observed yield of treatment k in row i. We shall assume that plot (1, 1) is 
missing which under R had received treatment m so that Xim is reported to be ОЙДЫ 
Here the estimate for the missing yield (Yates, 1936) is computed as b 


ers t Ritt Citt Tm— 2G 

z^ ED уа. -.. (3.1) 
where 

Ri = total yield for the (t—1) plots in row 1 for which yields were obtained, 

C, = total yield for the (L—1) plots in column 1 for which yields were obtained 

T", = total yield for the (t—1) plots of treatment m for which yields were obtained, and 


с’ = total of all the observed yields. 


The ANOVA table is obtained as follows : 


TABLE 3.1. ANOVA FOR A LATIN SQUARE (ONE PLOT MISSING) 


sources of d.f. sum of squares 


variation 


treatment (2—1) (T) (obtained by subtraction) 
error (t—1)£—2)—1 (E)m (obtained as the error s.s. in the completed 
Latin Square inserting 21m for the missing tım) 


troatmont--error (—1)2—1 (T --E)n (obtained as the (treatment error) s.s. 
in the completed Latin Square inserting Tim for 


the missing tım) 


ga 18211010 
“Im (t—1)2 


ums of squares due to Treatment, Error and Treat- 


T -- E), be the s 
the complete latin square if z,,, were avail- 


vely computed from 
sult holds : 


Let (T), (Е), and ( 
ment J-Error respeoti 
able. Then the following re 


Lemma 3.1: 


ts 1 2 (tis e 


1—1)? " 
PEU. (tim m) 
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Hence &(Е)„ = &(E),—6 ae (Eim Sa] 


тън), = ET +E) -E| Ê аи |. 


Let us now assume that (1.1) is true, i.e. 


Хх) = X4(2) = ... = X (t) = X; for all (i, j) 
and write ej = X54—Xi.—X;-X.., 
whero Жы== 232, RAS 1 
i tj js 4 pf Mae =} 


' As before it can be easily seen that 


ж p 
Lim —Lim = TF еп 
and that (T--E), = У X ej = (t—1}A (say) 
*v 3 
Z 2 La 
Hence (T--E), = = Р ей— T-TE eu = &(T+E),,. 
. ; 1 £ x t 
Al UL) mm o = Р. 
so since ET m) rae Е = > х, 0 = > Х, 
i=2 
and @ = ZZ Xj, we have 
(ij) 411) 
ве) = &, 
^ 
and hence Eltim Tim)? = (9, 1)? У =) 
CUN P Ua 
= eo. چ‎ TES VR " 
(1—1) at зууну (Жы). 
Tt is known that 
Ут. = Lh. S Я 
( m) ро Rel E е? 
where ej = Xy— XL XX: 
Xi-— EN 5 Жы. Ж 
r= فكي ل‎ Ме ль ХХ = x X 
=A =. 


[ Parrs 3 & 4 


(3.3) 


(3.4) 


(3.5) 


(3.6) 


(3.7) 


— НЕ С 


SOME REMARKS ON THE MISSING PLOT ANALYSIS 


It is also known that 


&(E), = (t—1)(t —2)4. 
| с )@—2) ... (3.8) 
Hence E(B) = (t -1)(t—2)4— 2—2) g A 
18 11 I w (89 
and at, s eA E 
m ( ) @—1)° TET ‚з (3.10) 
where А” = =" XX e; 
(1—32) 25 ü- 


The expected values of the corresponding mean squares are shown in Table 3.2 


TABLE 3.3. EXPECTED VALUES OF MEAN SQUARES 
IN TABLE 3.1 


——— 
sources of d.f. expeeted val 
variation F value of mean squaro 
treatment 1—1 А*-- м. (A’—A*) 
01: 7 
error 12—311 A*4- pe а; (A* — A 
(@—3+H¢—) ^ — ) 
rre (224 - cs ^ 
@—1)°—1 1) (0—1) 11 


Hence the F-test would be unbiased if and only if A* = A’. 


4. CONCLUSION 


Unless the exact nature of the process, by which an observation is missed 
s difficult to make any further comments on the missing plot "——— 
hile to note in this connection that if one plot is missed at semi 
lo plots, the bias disappears in both the cases considered in this 


is known, it i 
It may be worthw 
from all the availab 


paper. 
possibly affects the test of equality of 


The bias which we noticed in this paper 
ry use of the percentage points of the 


treatment effects only in so far as the customa' 
ratio distribution for judging the significance of the computed value of F 


te or underestimate the level of the randomisation test. When 
observations is relatively small, this distortion may be only 


variance 
will either overestima 
the number of missing 


of minor importance. 
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THE USE OF LINEAR ALGEBRA IN DERIVING PRIME POWER 
FACTORIAL DESIGNS WITH CONFOUNDING AND 
FRACTIONAL REPLICATION 


By NORMAN T. J. BAILEY 
Unit of Biometry, University of Oxford, England 


зарег discusses the derivation i ver i i i 

I р ‹ і of prime power factorial designs, with con- 

cation, using only comparatively elementary results in linear algebra. Some 

ystematization of the standard theory of factorial designs are given. The 
g à 

quick in practice, and are very easily understood and acquired 


SUMMARY. This 
founding and fractional repli 


further simplifications and s; 
ommended are also extremely 


procedures rec 
by students. 


1. INTRODUCTION 


A detailed account of the design and analysis of 2”, 3" and 273” factorial designs 
was first given by Yates (1937). Then Nair (1938) developed a method of dealing with 
a prime or the power of a prime, based on a theory of inter- 
changes connected with the associated hyper-graeco-latin squares. A more general procedure 
for constructing p" arrangements was subsequently obtained by Bose and Kishen (1940) 
using the theory of Galois fields and finite geometries, and a more elaborate account 
of this kind of treatment was later given by Bose (1947). Fisher (1942, 1945) made a consi- 
advance using methods which, at any rate for p prime, appealed only to the more 

These methods were also found suitable by Finney (1945) 
replicated designs. An alternative method of investi- 
1950) in terms of combinatorial arrangements called 
Kempthorne (1947) then made a further simplification and systemati- 
e technique used by Fisher and Finney, and a more detailed account of this theory 
a standard text-book (Kempthorne, 1952). Additional discussion E 
appears in Brownlee, Kelly and Loraine (1948) and Brownlee and Loraine (1948); 
ve been presented by Rao (1951). 
sent paper reconsiders the Fisher-Finney-Kempthorne approach. But 
ore explicitly to the standard theory of simultaneous ‘ines 
equations, it is shown that additional simplification and вувнешайшейон are possible. We 
are thus enabled to make a quick systematic check that the interactions considered for con- 
founding, or for defining contrasts, do in fact have the properties required; and we can find 
automatically generators for the intra-block subgroup and * single treatment combination 
fom deli of the other blocks. The method also (gives a imple proof of Fisher’s theorem 
on minimum block size. Although the teghimque is most readily applied if p is prime, it is 
very convenient if pis the power ofa рше; when we must use the appropriate addition 
1 ation tables for the corresponding Galois field. Not only are these procedures 
actice, but the writer has found that they are understood and acquired by 
than the usual text-book methods, since for the most part only com- 
standard linear algebra are used. 


р" designs, where p is 


derable 
elementary properties of groups. 
for the development of fractionally 
gation has been given py Rao (1946, 1947, 
arrays of strength d. 
zation of th 
has since been given in 
these topics 
and some useful tables ha 


The pre 
by relating these methods m 


also 
and multiplic 
very quick in pr 
students more easily 


paratively elementary results in 
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2. STANDARD DERIVATION OF PRIME POWER FACTORIAL DESIGNS 
The present discussion assumes an acquaintance with the following basic ideas, for 
further details of which Kempthorne (1947, 1952) may be consulted. Suppose we consider 
a p factorial, where there are n factors, A, B, C,... , each with p levels. For the time being 
we take p to be a positive prime. Then any interaction component АВС" ..., with р—1 
degrees of freedom, is given by comparisons amongst p sets of treatment combinations, each 
set being represented by solutions of one of the p congruences 


iv + jaty- Ex - ... = 0, 1, ..., p—1 (mod р). ... (2) 
The coefficients i, j, k, ... must all be restricted to the values 0, 1, 2, ..., p —1, and in order 
to obtain a complete and unique enumeration of all the degrees of freedom available we adopt 
the convention that the first non-zero index to appear in A‘BIC* ... must be unity. It is 
easily shown that any two congruences of the type shown in (2.1) differing by at lal one 
coefficient on the left, give rise to two sets of treatment combinations such that any contrast 


with one degree of freedom from the first set is orthogonal to any contrast with one degree 
of freedom from the second. 


Suppose now we want a design in which m interaction components are confounded 
with blocks. Then there are р" blocks, the composition of which is given by the solutions 
of p" sets of congruences. Each set of congruences contains m members. The left hand sides 
in any set are all different and correspond to the m interaction components chosen; the right 


hand sides are a selection, with repetitions allowed, from the numbers 0, 1 p—1 
, $ «wg md 5 
Fractional replication is dealt with in a similar fashion by adding to the simultaneous 


congruences already required for confounding further congruences specifying the defining 
52 - 


contrasts. The main difference in these latter congruences is that to each left hand side ther 
corresponds only one number on the right, whose value depends on which fixie f ‘fhe 
replicate is being used. 1 of the 


3. SOLUTION OF A SYSTEM OF LINEAR EQUATIONS 


We shall also require the following results for the solution of sg 
linear equations. An extended discussion of the basic theory may be found in C 
of Stoll (1952). Suppose we have a consistent system of m non-hom: ee р 


ystem of simultaneous 


B s 1 ; ogen i 9 .. 
tions in n variables zj, j = 1, 2,...,т, viz. geneous linear equa: 
n 
> aij == Yi, $ = 1, 
ES : zs (81) 


Then it is a straightforward matter to reduce (3.1) to th 
3. € echelon (or canonical) fi 
orm 


m 
Биа =z, i=l 
>. Rm nad (3.2) 


where I< kb < Ba < <a. < kr < n, 


(3.3) 
using only the usual elementary operations. The distinguishi 

g à í Я guishing char. isti 

a system are (i) the last non-zero coefficient in each equation is ünit a an ae 
the r linear forms are all different and follow the order of ditus s m igi "E 
i . appears with a non-zero coefficient only in the ; ade Shown in (3.3), and 
(ii) £r; арр only in the i-th equation. 
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From (3.2) we obtain immediately the solved form or general solution 


n-T 
ы = » ‚ Citt Fis t= 1, 0.73 ] 
С Г ... (3.4) 
= i-r, е1. каз) 
where the v; are arbitrary numbers. We are in fact writing the 2j. . i = 1, ..., rin terms of 
t 


the remaining n—r unknowns to which arbitrary values can be assigned. 


We now consider the homogeneous system of equations 


n 


‘аа = i= l т .. @5) 


jel 
equal to zero in (3.1). The equations corresponding to (3.2) and 


given by putting all the yi 
Consider the set of solutions given by 


(3.4) now have all the z; zero. 
1, йел, ] 

| 
es Cras 0, 1, 0, ..., 0), 


(3.6) 


Xnr = (Cis п-т Со» mrs °°°? Cry nor? 0, 0, 0, ..., 1), 


«sis WE, for the order of the 27. Tt is easily shown that (3.6) is a 


where we now take Wr, › Fa > 
olution of (3.5), in the sense tl 


basis of the gencral 5 hat the latter is given by the set, X, of all 


nor 
linear combinations like uj Xj- 

jal 

n it is often useful to employ the theorem that the general solution of 
where X is the general solution of (3.5), as above, and Xo 


X, as linear forms in the 2). 


When 7 < Е 
(3.1) ean be written as х+ Хх, 
is one fixed solution of (3.1) and we regard X and 


e results hold not 
] numbers, 
ts us to make the ap 


The abov only for systems of equations for which the m 
уп a; are all rea 
This fact permi 


but also for systems with coefficients, and hence solutions 
> 


and unknov Sage " 
plication to factorial designs described in 


in any field. 
the next section. 
R FACTORIALS OF TYPE p” 


4. APPLICATION TO PRIME POWE 
e last section to the type of situation envisaged in section 


ing the results of th 
that the congruences (2.1) ean be replaced by actual 


In apply к 
to be made 15 


2, the first remark 


equations 


jay jug} hast = 0,1, Tell 2. (41) 


ainders modulo р. If p is prime the system of inte- 


an algebra of rem 
o which all the results of Section 3 may be applied, 


provided that we use 
tute a finite field t 


gers modulo р consti 
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Suppose now that we have a p” factorial and wish to confound m interactions with 
blocks. The required treatment combinations in the p" blocks are given by the solutions 
of the p” possible systems of equations. Each system is of the type shown in (3.1) with m 
members, and the y; are a selection with repetitions from the integers 0,1, ..., p—1. 


і Checking the interactions to be confounded. The first step in deriving a design is to 
check that the interactions to be confounded are in fact linearly independent, and do not 
automatically entail the confounding of any interactions of order less than some specified 
number (often three). This is done by reducing the homogeneous system (3.5) to the corres- 
ponding echelon system. If m equations remain then both these and the original system 
are necessarily independent. All other interactions involved in the confounding must be 
given by all linear combinations of the rows of coefficients in the echelon system, It can 
usually be seen at a glance whether the required condition holds. If, for example, we wish 
no two-factor interaction to be confounded then we first inspect the echelon system to ensure 
that each equation has more than two non-zero coefficients. Secondly, we need consider 
only linear combinations of pairs of equations, since combinations of three or more must 
involve at least three non-zero coefficients. For each pair combined the last non-zero coeffi- 
cients in each equation provide two non-zero elements, and we h 


7 : ave only to ensure that at 
least one other always survives. This entails examining far fewer elements than the usual 


procedure. 


It should be mentioned that the echelon system technique w. 
(1947) in his more advanced treatment of certain factorial designs. 
‘canonical form’ instead of echelon system. 


as employed by Bose 
Bose uses the term 


Specifying the intrablock subgroup and other blocks. Having checked the ir teracti 
to be confounded we next determine the composition of the various blocks, Т] a "s wee 
ment combinations appearing in the block containing (00...0) is given by " e SU GL urea 
homogeneous system (3.5). This can be done immediately by creen. he solution of the 
as described in the last section. It is often convenient to s 


it is fairly large, Бу means of a basis (with n—m members), all linear combi ti i 

give the remaining treatment combinations. Since we are using int pa m 
complete solution constitutes a group—the intrablock subgrou Е pacts modulo p the 
set of generators. P—tor which the basis is a 


i g the echelon system 
pecify this block, especially if 


The composition of any other block, obtained by solving one of th i 
of non-homogeneous equations, is then given by the addition of one fix i, appropriate systems 
to the general solution of the homogeneous system. "This ін equiva] €d solution of the latter 
deriving blocks other than the intrablock subgroup. It is easily so ent to the usual rule for 
to write down one treatment combination for each of these block en ihat the quickest way 
complete replicate is to start with the control (00 ... 0) KS when confour 


appearing j . nding in a 
and then let the m variables corresponding to the зр, j — 1,2 5 In the intrablock subgroup 
i 4 £3 2 $c 


2 т, in tl 7 
run through all the values 0, 1, ..., p—1. he echelon system 


With fractional replication on the other hand We must restri 
combinations to those satisfying the equations correspondi strict this set of treatment 


l ng to th i i 

adopted. With large fractions this can be done by inspection ee in азва 

may be more convenient to use a more systematic method as follows Ts e 
. Jf, when manipulating 
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the rows of coefficients in the original simultaneous equations to obtain the echelon system 
we perform a similar set of operations on the unit matrix with the same number oF d. 
a matrix is obtained which when used as a pre-multiplier effects the transformation directly. 
When this matrix pre-multiplies the matrix given by all admissible sets of quantities appear- 
ing on the right of the original equations we obtain a new array whose columns give imme- 


diately a possible set of alternative values for the т, when the other variables are all zero. 


When using a complete replicate we have of course the original p" combinations all over 


again, and so do not need to go through this procedure at all. 


The illustrative example below should make clear how easily the method can be applied 


in practice. 
Illustrative example. As an illustration of the foregoing let us consider the example 


given by Kempthorne (1952, p. 426] of a 1/9 replicate of a 37 factorial confounded in 9 blocks 


each with 27 treatment combinations. The defining contrasts suggested for the fractional 
ABCD2E and CDBG, while 4B272G and BCDF were in addition to be 


replication were 
confounded between blocks. To check these interactions and obtain the intrablock subgroup 
wo consider the systems of equations whose left-hand sides have coefficients given by 


A suitable system is quickly found to be 


2100 
CE 


which is in fact derived from (4.2) by the sequence of linear operations defined by the pre- 


multiplying matrix 


1112 
0221 
2222 (4.4) 
2202 


(It is easily checked that (4.3) is the product of (4.4) and (4.2) in that order.) 


s originally suggested are clearly independent since we still have 
о v > E 

Examination of the reduced array consisting of the 

hat in no case can а linear combination of two 


h columus only shows t 
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rows involve less than one non-zero coefficient. The full array thus entails the confounding 

i V i 7 proce vi 3 “кы 1 
of no interaction with less than three factors. We accordingly proceed with specifying the 
defining elements of the design. 


A suitable basis for the intrablock subgroup is obtained by adopting in turn the values 
(100), (010) and (001) for the variables corresponding to the faetors 4, B and D, and then 
solving at sight the homogeneous system of equations whose left-hand sides 
by (4.3). Thus we first write down the bold-faced numerals in (4.5) 
the remainder using in turn each of the equations just mentioned. 
the intrablock subgroup are therefore 


are represented 
below and then fill in 
The three generators of 


(1010121), 3 
(01101192) ... (4.5) 
(0001122). J 


If we are to use the fraction containing the ‘control’ treatment (000 .. 
are nine systems of equations in all, one of which is homogeneous, w 
given by (4.2) and whose right-hand sides may be taken as the col 


. 0) then there 
hose left-hand sides are 
umns of the matrix 


7000000000 
000000000 
000111222 


(4.6) 
012012012 


Pre-multiplication by (4.4) gives 


- 021102210 

012201120 "mE 
021210109 

021021021. 


so that nine suitable treatment combinations, one from each block are 
, 


(0000000), ) 
(0020122) 
(0010211), 
(0010220), 

(0000012), A 
(0020101) 00,8) 
(0020110), 
(0010209) 
(0000021) 


> 


J 


The whole design can now be written down from the basic 


d " 
(4.5) and (4.8). “ning elements provided by 
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5. FIsHER'S THEOREM ON MINIMUM BLOCK SIZE 


- The theorem proved by Fisher (1942, 1945) on minimum block size is well-known 
ME r n 1 г. 9 
iis states that a p factorial can be arranged so as to confound m independent interactions 
between p" blocks, each of p" treatment combinations, with no two-factor interacti 

- raction 


confounded, provided 


... (5.1) 


When р = 2 this reduces to the requirement that the number of treatment combinations 


in each block must be greater than the number of factors. 


Consider the coefficients in the corresponding echelon system. There are m columns 


containing a unit element which is the last in its row. This leaves an array with n—m columns 
that no two-factor interaction may be confounded it is sufficient if 
each of these latter rows contains at least two non-zero elements, and if all rows are different 
one being à multiple of any other. The total number of ways of allotting the 
„.., p— l ton—m places is pr, but we must exclude the single case having 
(n—m)(p—1) cases with just one non-zero element. The remaining arrange- 
ments all contain at least two non-zero elements, but to any given arrangement there are 

multiples. The total number of arrangements available for 


p—2 others which are merely 
allocation to the m rows, such that the required conditions are satisfied, is therefore 


and m rows. In order 


subject to n 
numbers 0, 1, 2; 
all zeros, and the 


(n—m)\(p—V) 
(5.2) 


The condition that this expression must be not Jess than m yields the required result (5.1) 


after rearrangement Д 


6. DERIVATION OF A p" DESIGN FROM A 2” DESIGN 


on is the way in which a p" design may be derived 
he solutions available when p = 2 will be available 
action will involve fewer factors than the corres. 


Another point worth commenting 


from a 2”. Fisher (1942) remarked that “t 


1, with the assurance that no inter 
: H H . 

i P 7 result ar RS ч 
ponding interaction when p — 9." To see how this result arises in connexion with the present 


И ‚ the m homogeneous е i f: t 

т. onsider the echelon system for a g quations whose solutions 
Б gn i ani he corresponding 5 سے‎ 

give the intrablock subgroup in a 2" factorial ° dul 2) g set of n—m generators. 

at if we change to an algebra modulo p( > 2) then the same set of n—m 


It is easily see? th : og : 
| ss will sat on system of m equations which is derived from the first 


generator i E khê 

system by multiply ts, except the last m. each row Juys—il. Wes lava 

the essential ingredients of a p” design confounding m interactions, each involving no fewer 
i ži m-m 

factors than the original 2^ arrangement, In blocks of р". 


in genera 


isfy a new echel 
ring all coefficien 
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7. EXTENSION TO DESIGNS WITH FACTORS HAVING p? LEVELS 


So far we have been considering p” designs in which each factor has a number of 
levels, p, which is prime. When the numbers of levels for the several factors are powers, 
not all the same, of the same prime, e.g., 2* x 4, the introduction of pseudofactors is usually 
the most convenient method of treatment as it reduces the design to a standard p" form, 
in this case 29. This has however the disadvantage, especially with confounding, that some 
components of main effects of original factors appear formally as higher order interactions 
of pseudofactors. If therefore the factors have levels which are all the same power ofa prime 
it may be more convenient to proceed directly. 


When dealing with a p" factorial we made use of the fact that integers modulo p 
constitute a field of p elements. If the number of levels is the power of a prime, p°, we can 
still use the same basic technique outlined above since it is always possible to construct a 
field of p* elements. The theory of this involves the more advanced properties of Galois 
fields and has been admirably described in the context of experimental design by Bose (1938). 
The main complication is that the elements of the field are no longer real numbers, but can 
be adequately represented by them provided we adopt the appropriate rules’ for addition 
and multiplication. 


On the whole the remarks of section 2 still apply with the obvious modifications 
Thus any interaction component A*BIC" ... with p°—1 degrees of freedom 
comparisons amongst p° sets of treatment combinations obtained as solutions 
like 


is given by 
of equations 


ілу --јаз Каз... = 0, 1, 2, 309-1, an 
where now all coefficients and variables are elements of the 
obey the appropriate laws of composition. Designs with confo 
can be derived from the solution of simultaneous equations 
material we require is therefore a set of addition and mult; 
number of Galois fields likely to be required in practice, 


relevant Galois field, and must 
unding or fractiona] replication 
as before, "The only further 
iplication tables for the small 


Illustrative example. A simple illustration Should 
procedure clear. Consider the problem of laying out a 44 
each. The addition and multiplication tables for a Galois 


make the simplicity of the above 


factorial in 16 blocks of 16 units 


fiel 
can be taken as «8:07 22 elements, 0, 1, 2 and 3, 
4683338 TES 
0/0123 0|0 9 
0 0 
i| ^e j| 123 
B 0 3 3 i (7.2) 
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ON THE DERIVATION OF PRIME POWER FACTORIAL DESIGNS 


decide to try confounding the interactions ABC and BC?D TI ficients in 

s denen à à А he coefficients i 
ations whose solution yields the intrablock subgroup are me | 
ге given by 


[0121] e a3 


Suppose we 
the homogeneous equ 


spe - 1 fourth 

I f the first anc ourth columns shows that (7.3) can already be take: ec 

nspection of the y n as an elon 
t р helor 


forn Je і i у 1 
1. We immediately obtain the generators 


(7.4) 


where, as before, the bold-faced numerals are written down first, and the remaini 

d by substitution in the echelon system. Taking all linear oan "qi n 
es 13 further treatments, whieh with (0000) make up the e т _ 
ach of the other 15 blocks is obtained by ‘ringing the chan ела чи 


are not bold-faced. 


are calculate 
two generators then giv 
One combination from e 
numerals in (7.4) which 
actional replication into factorial designs of the present 


We can also introduce fr 
type by using the device already described in the illustration at the end of Section 4 
ion 4. 
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PROPERTIES OF THE INVARIANT Г, (m-odd) FOR DISTRIBUTIONS 
ADMITTING SUFFICIENT STATISTICS 


By B. RAJA RAO 


University of Poona, India 


р SUMMARY. In this paper, the exact form of Im (m-odd) is obtained in general tem f 

distributions admitting sufficient statistics. The case when m is odd presents some айрыш ан н 
are got over by using а certain technique. From this, the exact form of Im (m-odd) is Sain : = 
ре ПТ, the Poisson and the Binomial distributions, both when the parameters vary ma 


normal, the ty! 
Finally, it is shown that Tm. whether m is odd or even, for the normal dis. 


taneously and separately. 
tribution leads to the usual prior probability forms for the parameters. Extension to multivariate distri 
2 ‘ivariate distri- 


butions is immediate. Thus our results generalize those of Huzurbazar (1955). 


1. INTRODUCTION 


The statistical significance of the invariants is described by Jeffreys (1946, 1948). 
him in stating the prior probability of parameters in estimation 

Huzurbazar (1955) has obtained the exact form of Im 
Imitting sufficient statistics explicitly in terms of the para- 
form of J, for the type III distribution. He has also 
obtained (Huzurbazar, 1955) the exact form of J, for the normal distribution N(A, с) when 
A and о vary separately. He has shown that 7, leads to the appropriate 
Jeffreys for the parameters. 


The invariant J, is used by 
problems and significance tests. 
(m-even) for the distributions ac 
meters and has deduced the exact 


the parameters 
prior probability forms given by 


ә. DISTRIBUTIONS ADMITTING SUFFICIENT STATISTICS 


2. 


We shall take the most general form of distributions admitting sufficient statistics 


as given by Koopman (1936) ; 


де, аў = exp (8 9909-409-868) 2 Qu) 


denotes the set of p parameters (01, 05, +++) р). : Following Huzurbazar 


where a; for brevity 
Ге, ads = 1, 


(1955), we have; since for all ду, 


| exp [$ ење] = exp {—B(aj)}. xs (22) 


f the p parameters a;. We can express the o; 


pendent functions 0 
) can be expressed in terms of the us as 


Then Bla; 
written as 


Now the uz(&;j) are P inde: ' 
inversely as functions of the t S- 
Вӯ(о;) = bluz), 8° that (2.2) may be 


[oo (B eps A) a Ca oem 
= cC 
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i. i 
5 Р . 7 m а ey. 
The invariant J,, is defined as I,, = | | /" — f" тах, so that when m is even, it is 


nothing more than straightforward expansion and integration. But when m is odd, the 
absolute value presents some difficulties. For that we use the following technique. We 
split up the range of integration into two parts R and R* throughout which the integrands 
are positive. For brevity, write f(x, c) = f, f(x, 05) =f’, maj) = uz, ete. 


Now f' > f if, and only if, 
X (ш, B—B 
Vpl Ug — Up, — В’ D 
‚2 0—4) > s ... (84) 


Let В be the set of all points z in the interval (—co, co) for which the inequality (2.4) holds 


and let R* be the set of the remaining points, so that f^ < f for all x belonging to R*. Then, 
since m is odd, 


1 


tem | (Fee (Ê) 


R n* 


m er dot dio LE X 
ms [UA] dx — 9 | (r*- yas ... (2.5) 


Re 


Following Huzurbazar, we shall get, 


XE 


Y=0 


3» cris) exp (= B+ rg. I 2 


] ... (2.6) 
where the function b*(w,) is defined by [the relation analogous to 2.3] 
| exp { 5 м0) (ac) +442) } = exp {—p* 
R* ke (uy. (2.7) 


inted out here that the exact form of In(m-odd) cannot b ex 
1 О d e expressed ici 
X plicitly 


in terms of the parameters due to the presence of the function p* į 

of many particular distributions, b* can actually be evaluated as in (2.6). But in the case 

parameters as is Mown. in the examples discussed below. Tt is vn explicit function of the 

first series in (2.6) is just the exact form of Im (m- dug e to note that the 
y Huzurbazar (1955) 


even 
[his equation (23)]. ) 
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Exact form of Im (m-odd) for the normal distribution. We shall now deduce the exact 
form of J,,(m-odd) for the normal distribution N(A, с) from (2.6) when the parameters 


Л and с vary simultaneously. We have 


a? ‚Җа А? 
fea) = exp – Ir +a- paT log сут) (~œ < х < со). .. (28) 
1 А А 
Неге "y = o из = D and B(A, с) = — (et log ew/m)) 


з 2 
= -( = EA tog 4/7") = b(uyug). 


2u, 


Tt is easy to see that R* is given by the set of all points z for which 


(p? 072) 2206? A^) А02 — 2a"? pot) 20 e. (2.9) 
с' 
2020"? log > 
where p= mo ` (2.10) 


; 
Two cases arise according as © So omo. 


Case (1): 1% o> Then R* is composed of two intervals (—00, fy) and 


(из, оо) Where #1 and Halkı < Из) are the roots of the quadratic 


ato) Bah E i = 8 s Qu) 


given by 


Noro"? EK where K = + PO PHAN P ... (2.12) 


Substituting in (2.6) we find after some reduction 


te 
a, (97) son {enim =P} E [ ля "аа... "m 


— gmot 


"VEM oe a (214) 
where by = (т—9)0*--70 " 
X—X29,-K 
(3F and &= a i ... (215) 


= 0—0, 
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In terms of the error function, we can write (2.13) as 


m ern 
Im a А (Ё. == 
E 


„бу. exp [=e a \| ext (8; )—erf (5 )—1]| 


2т?о?о"' р 2 


(2.16) 
Case (2): Leto < с’. Then R* is the interval (из, из) and in this case the exact 


form of J,,(m-odd) would be given by just the negative of (2.13), 


: so that in the general 
case, the absolute value of (2.13) gives the exact form of I,,(m-odd) when both the parameters 
A and o vary simultaneously. 


Putting m = 1 in the above, we have immediately, since à = 0’ and 8, =o 


І, = 


erf ( RUP А ) E NM (ee, 3 .) dag | (Ajo K ры 


(0—0?) 3/2. с 


zs (A'—A)e-- К 
= еек, ). (2.17) 
` Exact form of L,(m-odd) when the parameters А and o vary separately. Let us now 
deduce the exact form of I,,(m—odd) when the parameters A and т vary зе ES E 4 
suppose that A is kept fixed and that only с varies, Then R* is given 4 ds a > oa 
m for which |z—A| =p when ¢ 07 and |e—A] <p when 4 < 0". Also * vm 
(2.12), К = p(o?—o6'?) we find . » Since from 


Ra 1 А: | p 
ya iam GY t \ fa) ... (2.18) 
x ! 


Next suppose that c is kept fixed and that only A varies 


3 Then one of the roots of the equation 
(2.4) is infinite and the other is z = АБА 


д; In this case R* is the interval (— 223) 


) when <А Considerir Separately the two cases when 
| . 1 ng sep ly hi 
X > Лапа X <A, ї may be seen that the exact form of J (m. 

t т\Т- 


when Л > А and (e. oo 


odd) is given by 
e Ert) (eoe 


Ea (m—2y) X — Aà 
М | 
= SEE = Am 
=0 24/2.mg- Hr (2219) 
Putting, in particular, m=1 in (2.18) and (2.19), one obtains the results of H; 

Hn i 8 of Huzurbaza: 1955). 
Туре III distribution. Consider th t; i x 
| T the type TIT distribution . 


f(t, a, p) = exp(—az--p log z—log sp log a—log Ty )} 
p 


(620. — .. (2.0 
Here = and и, =p and Bla, p) = p log a—log Гүр) FS 
p 
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In this case R* is given by the set of points x for which 


PP ore). zm. aU qui)‏ ی 
a^ Г(р)‏ 


For brevity we discuss below only the following two interesting cases 
(а a, p'—p > —1) and (a! <a, p —p € —1)- Let now yy and (и, < дз) be the roots 


of the equation 


Qmm, prono ТОР) — 0. ELS. 
а? Г(р) 
Case (1): Leta’ >a and p'—p > —1. Then R* is composed of the two intervals 
(0, pt) and (us, 9%). ` 


Substituting in the relation (2.6), we find after some simplification, 


т-у 7 r (erm) 
m ‚ш m m м 
т E Me : = 294, —On/)—1 
In= cr) er em еше МЕСЕ 
380 ren” Fe eT 


(2.23) 


mya Ya т—у p'A-YP ) is the incomplete gamma integral defined by 
where ga = | то: m 


which is extensively tabulated by Pearson. 

а and p'—P < —]. Then R* is the interval (ie) endin 
iust be the negative of (2.23) so that in the general case the 
the exact form of I, (m-odd) when the parameters a and p 
s xample we may obtain the exact form of J,,,(m-odd) 
putting 0 = а’ and p = p' in (2.23) successively, 
ressions will be greatly simplified. 


son distribution, defined by 


Case (2): Let а < 
this case Z„(m-odd) would 
absolute value of (2.23) give oy 
vary simultaneously- Asin the pr ES 
when the parameters vary pogana T : 
and it will be seen that in this case the exp 


1 is: 
Poisson distribution. Take the Po 


d * ENS: v ... (2.24) 
P we ; log a—log 2! а, = ( 
"UE dedi s 
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Here и = log q, Ма) =—a = —е" = b(u) and R* is given by the set of all points 2 
for which x < pifa’ «and v > pif a' < а where p is the greatest integer contained in 
, 
& =e It will be seen that the exact form of J,,(m-odd) is the absolute value of 
log = 


— Б ee 
Yer ca exp fa " a GENOME) (1—2P,) 2. (2.25) 


aA i 7% 


mr Y 
where P, = Prob (x < р), = being a Poisson variate with parameter с’ " o", which is 


extensively tabulated in the Biometrika tables. 
Tt will be seen that 


and that its differential form is 


І, = Fda, 
so that I, does not lead to the usual prior probability form E for a, as given by Jeffreys. 
a 


Binomial distribution. Consider the Binomial distribution: 
z 


Јар) = (8) (т^) (1—р)” = ехр {2 log rg log da )+ log a-py ], 


z —0,1,...,n. 


(2.26) 


Here u = log fs and B(p) = log (1—p)” = — log (L-e)? = b(u) 


and in this case R* will be given by the set of all Points z for which д <ô ifp > d 
р > pan 

> ô if p' < р where д is the greatest integer contained in log d р m [ tng 2'0—p) n 
It is easy to see that the exact form of J, "d Ip) 


m(m-odd) would be given by the absol 


ute value of 


a fh Pas. aw past а 
> { dies [a р) 0-0)" tam в") (1—2Р,), .. (2.27) 
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where P, = Prob (ж < 8), x being a Binomial variate with the parameter 


m-Y y mY Y 73 
n m 


т-у Y 
p m p? | an m (1—р)" + р p | 


P, may be obtained readily from a table of the cumulative binomial tables (National 


Bureau of Standards), or from a table of the incomplete beta function, since, 
K 
mV j р 
` (;) ай (1—2) = B, (n—K, К+). 
j=0 


Putting in particular, m = 1 in (2.27) it is seen that 
i=2 ЖУ: ô+1)—B,-p'(n—ò, 8-1-1) 
hat the differential form for I, is 


I, = Gp)dp 
does not lead to the appropriate uniform prior probability 


. and t 


so that in this case also J, form 
for the parameter p, 3$ given by Jeffreys 
obtain the differential forms for the parameters A and o of the normal 
$ "Im whether m is odd or even, leads to the usual prior 
А. апа с, when they vary separately. But the diffi- 
en they vary simultaneously. 


Finally we shall 
distribution N(A, с) and show tha 
probability forms for the parameters, 
culty arises, as noticed by Jeffreys, wh 

for small variations of A. Let m 


First, we shall obtain the differential form Of La 


be odd. Then to the first order, we have 


о m e 
ofi" (ал)" = NET T 
In = (GAY fe» dx = 2 їз. ВЕР t = const. (dA)”, 
í [a бю)" ym 
so that "JI, ос GA . (2.28) 
which leads to the appropriate prior probability form for A when с is known, as given by 
Jeffreys. 
Similarly when A is known, 
1 
d m do m 
zs = t. = 
-0 
(2.29) 


do 
BP HOC Rn © T 
obability form forc when A is known, as given by 


ads to the usual prior pri 1 
m is even. | 


hold good even when 
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which again le 
Jeffreys. These results 
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In conclusion, I wish to acknowledge my indebtedness to Dr. V. S. Huzurbazar for 
his valuable guidance and useful suggestions, which have greatly improved both the form 
and content of this paper. My grateful thanks are due to the Government of India for 
awarding me a Senior Research Training Scholarship. 
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NUMERICAL EVALUATION OF CERTAIN MULTIVARIATE 
NORMAL INTEGRALS 


By PETER IHM 
Botanic Institute, Freiburg, Germany 


ivariate normal integrals over а region B is given 


A numerical solution of mult 
1 matrix D and the product ofa row vector with its 


m of a diagonal 


B of a multivariate nor 
eter 7, the resulting function multiplied with a function of 7 and 


multivariate normal distributions with 


SUMMARY. 
where the covariance matrix is the su 


An n-fold integral over mal distribution with covariance matri 
Я IX 


transpose. 


D is calculated in dependence of a par 
The met 


{асе of а hyperel 


ата! 
hod covers the integral of all 


integrated with respect to T. 
lipsoid rotationally symmetric about the longest axis 


t over the sur 
f the multivariate normal integral has been the object of repeated 


studies (see David (1953), Plackett (1954), McFadden (1956)). A reduction formula given by 
Plackett is, theoretically. applicable to all n-dimensional normal distributions, but is quite 
We cannot, therefore, apply 
ocedure may be constructed by use of an n-dimensional ex- 
tension of Simpson's rule by Von Mises (1954); but in this case also the effort in calculations 
at to make it useful for higher dimensions. For the moment the most satisfying 
ms to be the Monte Carlo method by use of an electronic computer since 
ts to be tested is independent of the dimensionality n. The author 
by aid of the IBM Magnetic Drum Caleulator Type 650 by generation 
normally distributed components, but even for such & machine 
to reach а desired accuracy may be too great for higher n. It 
for simpler methods which apply at least for more special 
distributions. In this paper a simple method will be 
mal distribution, the covariance matrix 


density constan 


The evaluation 0 


this formula conveniently for » larger 
v o 


laborious for higher п. 
than five. Another general pr 


is too gre 
general method see 
the number of poin 
obtained good results 
of n-dimensional vectors with 
ount of time necessary 
e to seek 


g normal 
-dimensional nor! 


the am 
seems therefore justifiabl 
types of frequently осештіп 
given for the integration of an ? 
is of the form 
д-ру 
positive definite diagonal matrix, i à unit vector and c? > 0 a scalar. 
(n+ 1 )-fold integral 


of which 


where D is à 


Let u$ consider the 
а Иа) D) a. 
= | ea lapel pee Ù) абайт. (1) 
ела و‎ 
о 
o n-fold integral over the region B. We obtain 


А h 
where the Bese d integral sign stands for t 


© sp” (z—7i 272 
Doi? аран de, ... de, dr 


[ 0-0 


І = 11 i 
(2л) > p] =e 
c f | E sciatis dz, ... de, dr 
PNG 
en* [DI -e5 


os! ту-1>\? 
e o. коз Dzj de, ... de, dr 


a cibus 


mem 
= +1 i 
[1 DI 2% B 


DD)‏ ي 
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and 
а = ï Dic. жє HOF 
We get, by some calculations, 
А — Ос "EC 


which may be proved by direct calculation. We have 
АА" = (D--c-? i!) (D^ —a-? ii! D?) 


= 1-a iD "+e а i'D^ 3D ... (4) 


where I is the unit matrix. The last three terms of (4) should give the null matrix O, i.e. 
we should have 


—cii/D e iiD iiD iiD — О. 
Because of (2) we obtain 
(Ш› ЖЕЕ Dii ЖИТ жш i 
and after application of the commutative law for scalar matrix multiplication to the first 
term and of the associative law for matrix multiplication to the second finally 
iD i) iD iD) iD” = O 
which proves (3). 


We may in (1) invert the sequence of integration and first integrate over 7. This 
yields 


= 1 —AzA 
T= GAT | eA" ae. dy 


so that J is equal to an n-fold integral of a normal distribution with expectations E z о 
and E zz' = A, о being the null vector. Now, if D is Supposed to be a diagonal matrix 
and B an n-dimensional interval, it is easy to compute : 


i I= ear | ge cH PP СИИ de, (5) 
in dependence of т. The integral (1) is equal to 
oo 
i= - |. e^ I(r)dr e (6) 


which may be obtained by simple product integration using Simpson’s rul 
e. 


A very convenient method is the use of a Stieltjes coordina t 
where 9 


æ for the abscissa, т 


т 


= Qu — 40%? 
x= = e 1€ gg 
м?т ‘ie 
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Ц ) is then drawn over А З е integra. 15 obtaine y 
T а the transformed abscissa and the i i bt: d b 
a t à : і integ NEA ja 
posses er. This gives à relative error of 1 to 2 per cent. It is convenient to construct 
the Stie Jes integral abscissa for all values of c b substituting 7 [с for 7 W 

Stielt g y ] g . e get 


from (6) 


I= 1. | SPI ( 2 dr*, ... (D) 


that B be an interval nor that D be a diagonal matrix, but the method is 
o in cases where it is. B and the positive definite matrix D may be 
that <*> 0. c? = 0 is trivial and с? < 0 changes the 
he exponent of the normal distribution in (1) 


It is not necessary 
more easily applicabl 
of either form. It is only necessary 
positive definiteness of the quadratic form in t 
so that the integral (1) does no longer exist. 


Tf, for b? > 0, 
KB 1+2. m 


distributions with density constant over the surface 


ate normal 
ongest axis. This may be 


we have the class of multivari 
otationally symmetric about the 1 


of a hyperellipsoid which is 1 
shown by introducing 


Ed ) z = Sz, say, 


orthogonal matrix. The covariance matrix of y is 


о 1. ‚„ — b 1 A oo 
; Go) наб QN 


Thus, the length of one axis is proportional to (02072), CBE EDE U EE Shee 


to b. 
ple consider the integral over а distribution with c* = 1/2 and 


1 1 
— es 
а, 
1 0 у V2 4/2 
^- ( +24 1 


where S is an 


As an exam 


01 va 
integral must be 
where B is defined by zr 22 2 9 By theory the integr 
1 є E 
NN ds E 
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(see Cramér 1946, p. 290); p is the correlation coefficient. Since D — р =I we get 
from (6) 


I(7) ={ EN g Mel y a Yee | á ) 


0 


or for use of (7) 
I(r*4/2) = $*(r*), 


. 
where ¢(x) denotes the normal distribution function. 


05 


0 
= 0С-2.0 +15 -10 +05 ° 95 w 15 200€ 
-25 25 


The figure shows the values of ø? (7*) in dependence of r drawn on Stieltjes 
integral coordinates. Planimeter reading gives I = 0.334 which is in good agreement with 
expectation. 
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ON THE EVALUATION OF THE PROBABILITY INTEGRAL OF A 
MULTIVARIATE NORMAL DISTRIBUTION 


By 5. JOHN 
Indian Statistical Institute, Calcutta 


SUMMARY. Asimple reduction formula is obtained for the probability integral of a multi- 


The derivation involves only elementary resultsin probability theory. The 


variate normal distribution. 
bability integrals of multivariate normal distributions of 


formula obtained ean be used in evaluating pro 
order ё when those of order k—1 are readily available, An example is worked out illustrating the use of 
the formula. 


INTRODUCTION 


ans often assume many populations to be multivariate normal 
› 


In so far as statistici 
gral of the multivariate normal distribution is of 


aluation of the probability inte 
While this has been done for univariate and bivariate normal 


to populations of higher orders presents considerable 
1 reference may be made to David (1953), 


the ev 
especial importance. 
distributions, the extension 
difficulties. For previous work on this problem 
Placket (1954), Moran (1956) and Das (1956). 
NOTATION AND PRELIMINARIES 


The vector valued random variable X = (Ху, Х„,..., Xy) will be said to have the 
density function т.016 £) if it has the multivariate normal distribution with means 
д = (у, las oo др) and dispersion matrix X = (0%) 
E ; 
ndi; E) = Qn)? 1 |= exp {— 2—8) E. (®—/д). 


X. will denote the matrix (055 — Tir isl Tii) with the i-th row and column deleted. For scalar 


uù, 
p(t) = (И, А» э [i-1 Дань tt Hy) - (i — He s Pius Pase «+> Ён) 
where f, = Cys| C ss 
Given X; = % 
density function n, (fil 


X= (Kes Хр Kur cos Xp) is distributed according to the 


) X4) 
Tur NEW METHOD 


The problem is to evaluate integrals of the form 


rer b, be bp 
[ f- (и; Ed and D J y ngl; Ede (1) 
a, 42 4 
о ай = = 
| пи; BAe = | | (=a; Xy e (2) 
и ар 
by b; а 
f nq; Eis = | | n(b—p; 7)%® (3) 
and, | ps 
hs ет and p = (0-5 bp) (4) 
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Therefore we need consider only integrals of the type 


o o 
$ = | «f тб; Х)ах. s №) 
оо 
We now observe that J is the probability that U = min (X,, ..., Ху) is greater than 


or equal to zero. If f(u) is the probability density function of U, 


© 


t= | f(uytu. as 10) 


0 


To derive the probability density function of U, we employ the following simple 
argument. The event that U lies in the interval (u—du, и) can happen in p mutually exclu- 
sive ways. Either X, lies in the interval (u—du, и) and X, X4, X,, ..., Хр are all greater 
than u or X, lies in the interval (u—du, u) and X,, X3, Ха, ..., Xp are all greater than u or 
X, lies in the interval (w—du, u) and X,, Xa, X,, X5, ..., Хр are all greater than u and. 85 
on. Thus, 


2 © © 
Дш = 9 { [e ыш); $ дааа ела) 1% exp {ua 20}... (7) 
i-i u ч 
where da. = dz, ... dag 4 їх... dap. 


If all the variances gs are equal and all the covariances g.’ 
ў E eS Oj 8 are equal 
[a = tg m = Ш (1) will take the simpler form M qual аш also 


Ли) = pro)? exp {— (и) Bog | ү... [| mu at), X ins (9) 


There are also obvious simplifications when some of the sim 
€ simple or partial c ati 
отт ons а 
zero. elations are 


The evaluation of the density function (7) requi 
lity integral of multivariate normal icd É adam. : à = ns of the probabi- 
as a sort of reduction formula. When probability integrals of cheat 2 н regarded 
k are readily available, formula (6) may be used in «СЫЛЫ with istributions of order 
integration to evaluate probability integrals of order k-+1 an within i of numerical 
Thus with the tables now available, probabilities for distributions P abour, of order #--2. 
may be evaluated without much difficulty. We also feel that (6) of order three and four 
extend existing tables of the probability integral of ‘the TA be profitably used to 
(Pearson, 1931). This will require only the ordinates of the PA Уу е ! normal distribution 
at selected points and the area under this curve below these ie univariate normal curve 
for these are available in several places (Pearson and Hartley аи Extensive tables 
merce, 1953). i ; U.S. Dept. of Com- 
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ROBAB D 2 A IA' 0 A STRIBUTION 
The example gıven below will help to clarify some points. 
5 J 


Example: We shall evaluate 


o о о 
gem | f f Nal H; Х)ах: ах. ах 
111 se (8) 


= Qa. = т. 
33 B 12 O5 С. т. 
21 13 31 = Соз = C. = 0.6 


for the case д=0, би = 92» 
(We wish to emphasize here th 
а » that our method is appli 
applicable whatever, 
‘и апа -X). Formula 


(6) in this case becomes 


© 


Т=3 | ди) аи 
| 1 „> 9) 
where 
a ice jd = 
Ло) ЕЕ ехр | zione t2 net |а „= {10} 
апа и) = (2л)-ї%е—'* 2. (11) 


fla) = fi) - fs = 0, .2, .4, ..., 4.2 from value Я 
Karl Pearson's Tables for Statisticians and Biometricians, Part П, "e eee eH 
from Biometrika Tables. We used Weddle’s rule to calculate the integral of f(u) et 4 
3.6 and Simpson's three aluate the integral from 3.6 to 4.2. Of ae 
enient me eration could have been adopted. 


falu) was calculated for w 


-eighth’s rule to ev 
thod of numerical inte 


any other conv 
al of f(u) from 4.2 to co remains to be assessed 


n to (9) by the integr 
> 4.2. Therefore, 


The contributio 
(и) € O19 for y > 


From tables we find that fi 


© 
| (27)-1e Pdu = 0019 x .0000133 = (.25)10-7 ... (12) 


4.2 


о 
| (ud < 0019 
4.2 
alue of the integral from 0 to 4.2 may be taken as an 
om: zero to infinity. In general, we can always 
e finite number h which will give 
le, the result obtained by this 


Thus the У 


f the integral fr 
n in (6) by? suitabl 


f our examp. 


d negligible. 
he value о 
limit of integratio 
у. For the problem 0 


This we regar 
approximation to t 
replace the upper 
results of sufficient accurac 


method was 
ы т = 2907 г (19) 
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We now compare this result with the one given by the formula 


e 
(oye 1ar|2 | 
0 


ome 


o 

п аи — 608 —р)-Есоз-Щ Раз) с08-1( Pog) — л 
| exp {—}2M-12"} dx = 
0 


(14) 
where M is the matrix (pij) . (P11 = P22 = pas = 1) Placket (1954). Formula (14) gives the 


value .27554 which differs from result (13) by about .0035. This much of difference was 


expected since, in evaluating f(u) for various values of и, only linear interpolation was used 
with regard to the correlation coefficient. 
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THE DISTRIBUTION OF WALD’S CLASSIFICATION STATISTIC 
WHEN THE DISPERSION MATRIX I8 KNOWN 
By 5. JOHN 


Indian Statistical Institute, Calcutta 


an be used for classifying an individual as belonging 
the parameters characterising the two populations, 
here such knowledge is absent. The exact 
ix is known, is obtained in this paper. 


SUMMARY. The discriminant function c 
pulations provided we know 
certain statistic in situations w! 
the case where the dispersion matr: 


to one or other of two por 
Wald suggested the use of a 
distribution of this statistic in 


]. INTRODUCTION 


f classifying a given collection of individuals as belonging to 
one or other of two populations P,, P, based on measurements carried out on each individual 
p characteristics 21, Ye» +++» Tp: Let us assume that among the individuals 
, 2р follow a multivariate normal distribution with means 
variance matrix X and that among individuals belonging to 


Consider the problem o 


with respect to 
belonging to Py, Yr Ye + 


Its Hos vers Ир and variance-co | that a 
Po, 21,96 ees Tp follow а multivariate normal distribution with the same variance- 
covariance matrix X but with means yp Yor еэ Ур According to а method originally 

idual with measurements Ул, Jo: +++» Yp is assigned to P, 


suggested by Fisher (1936) an indiv 
or P, according a8 

= E Т 

(им) y ZG) (АУ) 


. 


It will be noted that Fisher's method 


(у, Ми ү Ур). 
is Wald (1944) considered the use of 


serito) and у= 
т X. To remedy th 


where jt = (Jt + 
dge of jt, У and 


requires a knowle 
the statistic 

u = (ABO) SY, 

when U < d, where d is so determined that 
vector of means determined from mea- 
individuals known to belong to Py; 22) is the vector of means 
ments on a random sample of ne individuals known to belong to Pa, 
ance matrix estimated from the pooled corrected sums of squares 
The distribution of U turns out to be complicated and 


for it. 


ssified as belonging to Р, 
. g) is the 


cla 
f the desired size; 


the individuals being 


the critical region 18 © 


surements on à sample of %ı 


determined from measure 
and S is the variance-covat! 
and products from the two samples. 


А P. at 
Wald has not given 2n explicit expression 
f U in the case when the variance-covariance 


considers the distribution 9 


This paper ) 
he distribution of 


matrix is known ie. t 
y EY (1) 


the same meanings 2$ before. 


w g”, X and y have 
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2. REDUCTION OF THE PROBLEM 


Looking at the problem from a more general point of view, it will be seen that the 
statistic whose distribution is in question is 


z—tXw o 


where ¢ = (В, ty, ..., tp) is a vector of p normal variates following a multivariate normal 
distribution with variance-covariance matrix X and means a = (dy, а», ..., Up), and where 


w = (№, Wz .. Wp) is a vector of p normal variates independent of ¢ and follow 
ing a multivariate normal distribution with variance-covariance matrix X and mean 
b = (b, by, ... bp). The statistic V reduces to z when multiplied by (n,na[m4 4-n3)*. 


In the further reduction of the problem we require the following. 


Lemma: The statistic z is invariant under non-singular transformations of t and w 
provided the two transformations are the same. 


Proof: Let «= tC and y = w О where Cis a (p Xp) non-singular matrix. Let X, 


denote the variance-covariance matrix of x (which is the same as the variance-covariance 
matrix of y). Then 


Zo = C'ZC and adply’ = tQ(03X0'3)0'w' = t Dw". 
This proves the lemma. 


Since, when X is positive definite, there always exists a non-singular matrix C 


such 
that 
GAG = 1 
this lemma reduces our problem to a consideration of the statistic 
T= 01-0505... Upp (2.2) 


where Uy, 01, Ug, Vo, ..., Up, Up are independent normal variates with unit variance Without 
loss of generality we may assume that 


E(u) = E(u.) =... = Е(и,) = 0. 


Let Ev) = mi = 1, 2, --. 56) a» (d) 


The expectation and variance of T are easily found. 


HD) = È (uo) = 0 


since, by our assumption Eu) — 0 (era 


VT) = È Vus) = È Bugg) — 
t=1 


$=1 


n 
E (+m?) =p i m?. 
і * 


To find the distribution of 7 we adopt the method of characteristic functions 
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WALD'S CLASSIFICATION STATISTIC FOR KNOWN DISPERSION MATRIX 
3. THE CHARACTERISTIC FUNCTION OF T 


The characteristic function of w;v; is 


e = Bee") 
i 


== iy PR, 


+o А 
—3v;292 — A(vj —mi)* 
1 | А 301202 — (vi — mi) dv 


= (140°) exp [- xit] к= (8А) 


and hence we get for the characteristic function of T 


40) = e) = 0-89? exp {—- ру} 


Ие 
where m = } X mj. 

i=l 
If we denote by p(T) the density function of T, we then have, by inversion, 


To ipn 
gt) =: | e "ue ... (82) 


version We distinguish two cases viz (1) p even and (2) p odd. 


For the purpose of in 


. 4: N, 
Inversion : Case (1), pem se 


In this case, 


тб? 
(0) = (1+) exp (я) 
1 1 m ү m? 
dir. | (СЕ 21 garet]: asa 025) 
Therefore, 

+o  _. * —і9Т m eT 1 

—iT pn dE c ДО, comme i 
l | е (040 = 5— | heap 1! aster ]. ... (3.4) 


2m 
e 
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The series occurring as integrand in (3.4) being uniformly convergent, permits term by term 
integration and we may write, 


© 
p(T) = e > d NIU] .. (3.5) 
T-0 
where 
1 eer 
gs 
eT) = = J DIY 


\ 


= ge lemen А AAS, (еч) 1) | 7 L7 4 


(2r—2) ! 
Tet c ... (3.6) 


(The last step is obtained by contour integration). 


Case (ii), p = 2n4-1. 


The characteristic function of Z = zy where x and y are independent normal variates 
with unit variance and means zero and b respectively would be 


ea) = (1+0 exp [— a 02 } 


EN (3.7) 
where a = 30?. Also, the density function of Z is 
=: E Hk 10+ Kk ant OD st... | (3.8) 
ri C 
2 € 2 "m í А 
where K, = К, (2) = 3 =) | e "pu [see Craig (1936)]. ... (3.9) 


Using the inversion theorem for characteristic functions we get from (3.7) and (3.8) 


E —ies „Шу, К. К. 
| € (14-02 Ө = Е Ea Tr! 22| a+ a (амер... |. 
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Therefore, 


+o 2 
EN S a[(14-82) K is x. 
| e S 4-3 [ Kotz sela ê aa., e (8.10) 


= ү(а, 2) (say). 


Differentiating both sides of (3.10) n times with respect to a, we get 


+o Р a[(14- 02) д" 
- $02 € ЭЕ faf, 2). 38x H 
| TIER agr (2) (8.11) 


[We note that differentiation within the integral sign in (3.11) is permissible since 


+o = ee g «I 0.3-0?) 
| прат 


is uniformly convergent]. Since the characteristic function of 7 is 


(1402) "+P exp { = ree | = е" (1--0*)-®*№ exp (m(14-602)-1) 


it follows from (3.11) that the density function of T is 
щт) = a [2 ya T - es (842) 
27 = 


N Tho derivatives of yla, T) required in (3.12) ean be obtained from the relation 
ole : e deriv: у 


211 QTY. Karel: «ces 
ат) =2| И met т Heat | (3.18) 


3) can be regarded as a power series in ‘a’. The 


ries in (3.1 
For any given value of T, the series Me | | values of ‘a’ from the ratio py of the (v-+1)-th 


series is easily seen to be convergent for al 
term to the v-th term 
_ 7l К, a= у (вау). 
Pv = Уу 1) Ку-1 y 


] <8 for all sufficiently large values of 


that |с 
roves v 3.13) is convergent for all values 


; e, S 
referred to MIU. the power series in ( 
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of ‘a’. Therefore the derivatives of (а, Т) w.r.t. а can be obtained by term by term differ- 


entiation of 


2! ary Р 
2 [Kot 2171 Ka Ka ] 


Tables of K, (T) can be found in Watson's book "Theory of Bessel Functions’. 


2 s : CET s А 
9. Some approximations to the distribution of T has been considered. Details 


will be given elsewhere. 
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AN EXTENSION OF HALD’S TABLE FOR THE ONE-SIDED 
CENSORED NORMAL DISTRIBUTION 


By NIKHILESH BHATTACHARYA 
Indian Statistical Institute, Calcutta 


1949) outlined a very convenient method of maximum likelihood estimation 
censored normal distribution and gave tables for facilitating the process. 
Table 1 below is an extension of Наа main table (Table ПТ). Hald's table gave the values of a certain 
) for values of h = 0.05, 0.10, ..., 0.80, and for some appropriate values of у, 
d observations in the sample. The present extension gives the values of 
05, for use in situations where the censored observations cannot be ignored 
h they form less than 5% of the total sample. 


SUMMARY. Hald( 
of the parameters of a one-sided 


function 2 =f (h, y 
h boing the fraction of censore| 


z fcr somo values of k below 0. 


for purposes of estimation, even thoug) 


ation is briefly as follows : 
Suppose there are % observations from a normal distribution with mean £ and variance 


o, and it is known that a number, say a of these observations are less than or equal to a known 
point of truncation. The values of these @ observations are not further specified, unlike the 


values above the truncation point, which may be denoted by ty, 2, ++, n-a: 


t of truncation is taken as 


Hald's method of estim 


The poin the origin. Let now 


ote 
\ 
| 


> 


© 
c 


-F , wu) = | 9000 Wu) = log, au) 


1 
gu) = Jin e 3 


and y'(u) the first derivative of y(u). 


4 denote the observed degree of truncation in the sample. 


Let h= n 
Hald defines 
(2) lel aoe 
h, 2 > 


708,9 = [90h a]: 


and 
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Let now the inverse function to y = F(h, z) with respect to z be denoted by z = f(h, y). 
This function was tabulated in Table III of Hald (1949) for h = 0.05, 0.10, ..., 0.80 and for 
some appropriate values of y. 


The estimate £ of Cis then obtained by calculating 


and reading é = f(h,y) from the Table. 


The next step is to calculate 


n-a 
vi 


ê =8 


and finally Ё = —tg. 


The function g(h,z) can be easily calculated. Table І 
Я à ES V of Hald’s А 
used for this purpose, but direct calculation is not difficult, Е ung! Ty 


The need of the present extension was felt in certain cases of fitting ided 
^ one-sidec 


1 al distributions to grouped data. The values o = = 

f were found to be 
ften below 0.05, and sometimes of the order of 0.01. AL houg he censo t 

p t ht sored part could be 


ignored without much loss of information, it would be desirable to mak 
because for examining goodness of fit the tails are valuable, e use of 
table change sharply with h, as h approaches small values, 
of question. 


it, especially 
ш of 2 = f(h, y) in Hald’s 
raphical extrapolation was out 


The present extension intends to facilitate interp 
The column for € 0.001 is particularly in point. This value of h į 
the range of practical interest. However, cases with h гаа clearly outside 


x = 0.005 
and the column for h = 0.001 will enable one to interpolate for г a 08 sre it оов 
3 " Such values, 


olation for values of № below 0.05. 


The o were based on the Table of Normal 
published by the National Bureau of Standards. The figura mal Probability Functions, 
third place of decimals. 5 tabulated are correct to the 
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The author is grateful to Shri Rabindranath Mukherjee, Shri Ramdulal Chatterjee 
and Shri Amal Kumar Sengupta, for the computation of the table. 


REFERENCES 


Harp, A. (1949): Maximum likelihood estimation of the parameters of a normal distribution which is 


truncated at a known point. Skand. Aktuar, 119-134. 


Tables of Normal Probability Functions, National Bureau of Standards, Applied Mathematics Series, 28. 


Paper received : August, 1958. 


379 


Vor. 21] 


SANKHYÀ: THE INDIAN JOURNAL OF STATISTICS 


TABLE 1. 


{Parts 3 & 4 


VALUES OF FUNCTION 2 = f(h, у) FOR FITTING ONE-SIDED 


CENSORED NORMAL DISTRIBUTIONS 


h 
y 
0.001 0.010 0.020 0.035 0.050 
(1) (2) (3) (4) (3) (6) 
0.500 mr 
0:505 —4.405  —3.774 
0.510 —4.039  —3.494 
0.515 —4.937  —8.7158 —3.268 
0.520 —3.959 —3.458 3.080 
0.525 —4.491  —4.021 3.248 
0.530 —4.043  —3.723 —3.072 
0.535 —3.747 —3.482 —23.922 
0.540 —3.508 —3.283 —2.792 
0.545 —3.310 —3.114 —2.677 
0.550 —3.142 —2.969  —2.799 —2.576 
0.555 —2.997 —2.843 —2.689 —2485 
0.560 —2.870 —2.731 —2.591  —92.404 
0.565  —2.759 —2.632 —2.503  —9.329 
0.570 —2.659 —2.542 —2.423 2.289 
0.575 —2.569 —2.351 —9 
0.580 —2.488 9:284 Сао 
2.142 
0.585 —2.415 о E 
2.089 
0.500 —2.347 —2.167 —2.040 
0.595 —2.285 —2.115  —1.993 
0.600  —2.227  —2.148  —9.000 —1.950 
0.610  —2.194  —2.053 —1.978  —1.872 
0.620 —2.034  —1.969  —1.900  —1.802 
0.630 —1.954  —1.894  —1.830  —1.739 
0.640 — —1.883 —1.828  —1.708  — 1.683 
0.650  —1.820  —1.768  —1.712  — 1.839 
0.600 1.762 —1:713 1.660 1:622 
0.070 —1.710 —1.663  —1.613 1:541 
0.680 —1.662  —1.618  —1.570 1:501 
0.690 —1.017  —1.576 —1.530 1.461 
0.700 —1.577 —1.537 т 493 
0.710 —1.599 —1.500 1459 1:280 
0.720 —1.503  —1.400 —14% 1:598 
0.730 —1.470 —1.435 1.306 б] ош) 
0.740 . —L.440  —1.405 1:868. 1:310 
0.750  —1.410 —1.377  —1.34 
0.700 —1.983 узы =] Sih Yao i 
OLO EAST 1928 шат одо иа 
0.780 —1.333  —1.303  —1:269 1:939 
ОСТЕО 0 313810. .—1:980. E cL: y ogy 
0.800 —1.288 —1.259 —1.2 
0.850 —1.192 1167 1:227 —1.181 j 
0.900 —1.115  —1.092  — 1.006 о 
0.950 —1.052  —1.030 —1.006 3030 
TODD 3204082 0.909 0.45 n gay 
1.050 —0.951 : = 
1.100 —0.911 ? mir is 
1.150 —0.875 à —0.833 0-841 
1.200 —0.843 : —0.80; 0-808 
1.950 —0.815 1 —0.780 m 
1.300  —0.789 5 d ; 
1.350 —0.765 > 0:795 0.28 
1.400 —о.744 Г TS 05006 
1.450 —0.724 К =01602 9.088 
1.500 —0.705 й 8707675 me de 
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ALMOST UNBIASED RATIO ESTIMATES BASED ON INTERPENETRATING 
SUB-SAMPLE ESTIMATES 


By M. N. MURTHY 
and 
N. S. NANJAMMA 


Indian Statistical Institute, Calcutta 


SUMMARY. In this paper à technique is developed to estimate the bias of an ordinary ratio 
estimate to a given degree of approximation on the basis of the interpenetrating sub-sample estimates. 
This estimate of the bias is used to correct the ratio estimate for its bias, thereby obtaining an almost 


unbiased ratio estimate. 


1. INTRODUCTION 

e method of ratio estimation is used to estimate 
variousratios. Itisalso used to estimate totals where supplementary information is available, 
since under certain circumstances usually met with, it is more efficient than the conventional 
methods of obtaining unbiased estimates. But a satisfactory treatment of the bias and error 
of a ratio estimate is not yet available. However different sampling and estimation procedures 
have been given which provide unbiased ratio estimates. In this paper, two different types 
of ratio estimates based on estimates obtained from 9 independent, and interpenetrating 
sub-samples have been compared from the points of view of bias (to the second degree of 
approximation) and mean square error (to the fourth degree of approximation). This study 
helps in obtaining an estimate of the bias of the ratio estimate, for any probability sampling 
Once the bias is estimated, the ratio estimate can be corrected to give an unbiased 


ased to the second degree of approximation). The gain in precision 
pared with the biased one has been studied. 


In large scale sample surveys, th 


design. 
ratio estimate (unbi 
| ratio estimate as com 


of this unbiasec 
e generalised to estimate the bias of a ratio estimate to any 


degree of approximation, using.a series of ratio estimates based on a number of independent: 
and interpenetrating sub-samples. These generalised results, for the partieular case where 
the estimates of the variates in question are distributed in the pivariate normal form are 


given in sections 7 and 8 of this paper. 


The above results can b 


в THE BIAS AND MEAN SQUARE ERROR OF A RATIO ESTIMATE 


the population totals of two charac- 

red as an estimate of the ratio 

18—21 
2 


i. EMT 
| (Hi , it can be 
y x 


9. APPROXIMATIONS FO 
f у and. v, 


unbiased estimates 0 | 
probability sample. 91 can be conside 


sistent but piase 


Теў and & be 


teristics, based on any 
< Тапа neglecting 


d. Assuming that 


R = ух. This.estimate is con 
terms of degree greater than two in the expansion of ( 1+ 


shown that the bias of g/t . 
1 (gvar(£)—cov Ê 8). (2.1) 


BOIS) = 3s 
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The mean square error of 9/2, to the fourth degree of approximation, is 


M(g/é) = Е { Hoz Зла | be | | 2( Зи з 1н) | з( Hoz Зи у Han )\ 
2 


y? zy E аду ay? а3 ay? xy кй 


(2.2) 
where pij = & [(£—2) (0 —yy]. 


If the sample size is fairly large, the assumption 2 < 1 сап be considered to be 


valid. Further x usually denotes the number of persons, or households,or some such charac- 

teristic fqr which we expect reliable estimates with a good design. 

sampling, a large number of empirical studies have shown that gener. 

|2— 2—2] 
т 


For simple random 
ally if the sample size 


is greater than 30, the assumption that < 1 is valid; and that the contribution of 


the higher degree terms to the bias and variance of the ratio estimate will be negligible. 


3. COMPARISON OF TWO DIFFÉRENT RATIO ESTIMATES 

Let (у. аң) be unbiased estimates of the wea totals y 

independent interpenetrating sub-sample (i — 1, 2 
can be considered as estimates of R = y/a 


and a, from the i-th 
-, A). The followi ing two ratio estimates 


(i) В, = +... 
tittat.. Htr 
(ii) xc К z Dies и) 
n 
X yi[n 
Applying result (2.1) to pal = 


= p We get the bias of R,, 
PEL 
t=] 


B(R) = В, = al R var less E ERE 


n n 


20( )— yi, yi) } 
(m 
-4{ Ув 4)} ove (8.1) 


1 У 
Bias of Rm B(R,).— В, = B (2) "o 
а . (3.2) 


Comparing (3.1) and (3.2), we note that the bias of R 
degree of approximation. 


since H(i yj) = 0 fori ¥ j 


Ёл? 
i= 


п IS 2 times that of R,, to the second 
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We now compare the mean square errors of R, and R, to the fourth degree of approxi- 
mation, assuming that the sub-sample sizes are the same (as is the case generally) so that 


в(") ZRB 


vi 


Hrs (is Yi) = Hrs 


кой u(%) — M for all i. 
аң 


By applying result (2.2) to R, and simplifying, we obtain, 


2 
ue) = M,-E { 


Hos q Hao _ 2n ) 2 ( 252 — зо _ [аз ) д. 
n 


y? qu xy mAGXy a у? 


3(n—1) E: оз 2h браны) 
ml “gh T xy? Tus y? ау 


43 (ma t =) + 
n x 


ay? wy 


QUE tly ... (8:3) 
n n“ 


2 95 Е Js | 4304-1) ( Hao aa _ Зы) ge 
where А = Е? [2 ( T cs er ay: jm T a eye абу 
3 (Зи 1 solos DR __ fal Jl 
ол ( S F ay? uU xy , 
n 
d = [1 Yi __ z 
and мв) = м, ê Ê ° = 8 | 2,028 Jl 
t=1 


n 2 n : ' 
орали Î - 


t=1 


в z A А ү کے‎ 


(3.4) 


n—l n—l pz 
from (3.3) and (3.4) Mn = MT А+" В. 
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Comparison of 27, and M, is dificult in general. Hence it is assumed that £ and f are distri- 
buted in the bivariate normal form in which case the bias and mean square error of 9/# reduce 
to 

B= Rey(ez—Pey) a 

M = RUE pee, 9 (1-358) 08306, peg] 


Further A= 3 R202{(c2—2pexty+-C2) -2 (65 — рсу)* |, which is >0 
© *_ Те o dos 
where сЁ = E ‚би = ^ j 


and p — correlation coefficient, between @ and 9. 
The mean square error of R, is greater than that of R,. Thus В, is better than Ry from 
the considerations of both bias and mean square error. | 


4. ESTIMATION OF THE BIAS OF THE RATIO ESTIMATE 


An unbiased estimate of the bias of the ratio estimate to the second degree of approxi- 
mation, is given below. 
é (В,) = R+B, 
& (Ra) = R--B, 
s @ (R,—R,) = В„—В,. 


But В, = nB, 
э. (R,—R,) = (n—1) By. 
Ê, = 5, is an unbiased estimate of B,, the bias of R, to the second degree 


of approximation. 


The variance of the estimate of bias of R, is given by 


^ 1 „19.00. 
V(B,) = (n—1y (7+ V,—2P n iR, ЛАД 


where V, = Variance of В, = M,—B? 

Va = Variance of В, = M,—B2 
5h ay: PRR, = Correlation coefficient of R, and Е. 
.. using (3.1), (3.2), (3.3) and (3.4), we get 


N 


y, = Vat = (B:— A) 


^ Vn 
Ү(В,) = ет (?—2P p.p,c-+1) 


(4.1) 
n—1 BA 


gee 
where + ЖҮ Ve 
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If 2 and 9 are bivariate normally distributed 

A—B? = Зе В (с? — Эрозсу-- са) $ (Co—Pev)"] 
which is greater than or equal to 0. 


Tt follows that a < 1 and therefore У, < Va. Thus it may be observed that R, is 


a better estimate than Ra from the point of view of bias, mean square error as well as 
variance. Е 
P The expression for the correlation coefficient between №, and Rp to the fourth degree 


of approximation is 


У [азо деде Ente c (1—p) ] 
i = (ne)? n n 
PR Ra = (11 —5p 42) (lle 5p- 20) 


2. (42) 


under the assumption that 
(а) $ and f are bivariate normally distributed with the same coefficient of variation 
(с, = Cy = 0) and 


(b)! (a2-- aao i) < 9[V($)-4-n 2°]. 
In the above expression p stands for the correlation coefficient of ĉ and ў. If с is small, 


[IM will be nearly equal to L 


t of variation of the estimate of bias may be large; still it may be possible 


The coefficien А к - 
o corrected for its bias which is more efficient than the biased one. 


to get a ratio estimat 
Б. AN ALMOST UNBIASED RATIO ESTIMATE 


Since we have obtained an estimate of bias of R,, that can be used to correct R, for 
Imost unbiased ratio estimate. 


its bias, and we get an a 


R= ( n a) 
i m=, 1 
1 This assumption was necessary to derive the @ (Е.Р), for 


_ &@(вукь)— ERDE Rn) х 
PRıRn (UG) VOS) 


& (RiRn) & жї раша... 


2) V (e!) —(7(®) tnv?) cov. (е, £’) 


_ Yin? , ق‎ А 


V (a) ne? 


)- (T0) +04?) 
(709) tne). 


e = (024501081 eee ryan. 
where 
ғ = (22-4-2122 vee Жалап) 
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We say it is an ‘almost’ unbiased estimate because it is unbiased only to the second degree 
of approximation. The variance of the corrected estimate is 


FR) = № 


(n—1): ( n? Ve DAA 


V, 242 - 
= nl n?32—2npg g etl ) vm (9.1) 
^ тһе gain in precision due to using В, instead of R, is given by 
(Б) = M,— ИВ.) m na? —2nP y, n, etl 
1 M, (n—1) (92-22) 
where 2= P. B 


(5.2) 
nVn nV 


based on one sub-sample. 


B and V being the bias and variance of the ratio estimate 
It may be noted that 


ІВ] 
Tr <% 


where c; is the coefficient of variation of the estimate ĉ based on опе Sub-sample. If the 
sub-sample size is large c; will be small. 


Hence z? can be neglected. 
neglecting 22 does not amount to neglecting bias. The gain in precision can be written as 


It is to be noted that 


na? —2np a+] 
OR) 1 — — 


(n —1)*a? 
Further the expression in (5.2) is greater than that in (5.3). 


(5.3) 
G(R.) > 0, if (n— 10922 — (2022р, в 91) > 0 
18, 
le. if (2n—1)a?—2npy p, 2+1 = 0); 
which will be true if c lies between the roots of the equation 
- ac == 
(2һ—1)= 2npy „а = ... (54) 
пр (np. р —2n--1)3 
(1.е.) if < lies between. a ai iis di. 
(2n—1) 


For given values of о and PRR», the minimum value of n which makes G 


(В) > 0 is given by 
= (1—о?) 
nes | c om Чуу 
Es E * 


" (5.5) 
386 


RS ت‎ 


T 


== 


RATIO ESTIMATES BASED ON INTERPENETRATING SUB-SAMPLE ESTIMATES 
Tt сап be seen that G(R.) will be positive only if Paro & Further for given values of 
її = 


a and PRR, where Prin, > а, the value of x which maximises the gain is 
= n 
n= Г. ... (5.6) 


For given values of 2 and PR, R the value of g which maximises the gain is 
n 


1 
a= * 
NPR, Ra 


A table showing for given values of Par and а, the minimum value of n required to make 
itin 


the gain positive, the optimum 7 and the maximum gain are given below. 
P 


MINIMUM AND MAXIMUM VALUES OF G(R.) WITH THE CORRESPONDING VALUES 
OF п FOR DIFFERENT P AND a WHERE P >а. 
RıRn RyRy 


minimum maximum 

= „ Rin n ад п  G(R) 

(0) (1) (2) (3) (4) (5) (6) 
1 0.6 0.7 6 0.0089 10 0.0192. 
2 0.8 3 0.0556 4 0.0988 
3 ~ 0.9 2 0.0889 3 0.3056 
4 0.7 0.8 4 0.0113 7 0.0266 
5 0.9 2 0.1020 3 0.1684 
6 0.8 0.9 3 0.0469 4 0.0486 


6. EMPIRICAL STUDY 


‚с have discussed the efficiency of the corrected estimate Re as compared 
that under certain circumstances А, will be a 
e give an example where the variance of 


In section 5 w 
to that of R,. There it has been pointed out 
better estimate of R than R,. In this section W 
R, turned out to be less than the mean square error of R, 
consist of the village-wise figures for the number of house- 
e village market for a sample of 300 villages 
the population two samples of size 30 villages 
tarts to estimate the number of persons 
b-samples the estimates R, and R, are 


timating the bias of №. 


The data for this study 
holds and the number of persons attending th 
ed over a wide region. Treating this as 
ally with independent random 8 
ehold. From these two su 
e is found by es 


scatter 
are drawn systematic 
going to market per hous 
caleulated. Then the corrected estimat 

For the pur 
Then for each о 


sible pairs of systematic samples are enu- 


pose of this study, all the pos 
— R; are calculated. The variance of Re, 


merated. f the pairs Ry Ry, and Fn 
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the corrected estimate and the mean square error of R, are determined. The results are 
given below. 


Population ratio = 7.6857 

E(R,) = 7.9211 and E (Ra) = 8.1401 

B(R,) = 0.2354 and B(R,) = 0.4544 
It is to be noted that B, is almost half of B,. This may be taken as indicating that second 
degree approximation is good enough. The variance of R, and mean square error of I, are 

V(f;) = 8.9992 and M(R,) = 9.6144 
PR,R, = 0.9856 and о = 0.8871 
G(R.) = 6.4% 


7. COMPARISON OF A SERIES OF RATIO ESTIMATES 


When n, the number of independent and interpenetrating sub-samples is a multiple 
of 2, 3, ... and k, we can construct the following series of ratio estimates, 


= ә? 
m Um n 
2 A > s b yi 
pp m t=" 41 t=(m—]) " 41 
Rk, = = m m 
пут E | AX саа چ‎ 
m ^ 
Уа У а У а 
Чан, йы Ж : n 
L E а i—(m—1) mul 
where m = 1,2,3, ..., K, m. 
n 
1 


$ = (7—1) +1 
2 ae where (Ray Mm H 


# = (7—1) 9 01 
- m 


It may be noted that there ав ^! — — different w 


wir 


samples into m partitions each containing n/m sub-samples, 


ays of partitioning the n sub- 


In practice the situation may arise where (£, 9) are approximately distributed in the 
bivariate normal form. In such a case we may make use of the Properties of the bivariate 
normal distribution for writing down the expressions for the bias and mean square error of 
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the ratio estimate. It is to be noted that the infinite series for the bias and mean square 
error are divergent. As has been rightly pointed out by Cramér and Kendall in their books, 
in statistical practice one is interested not so much in the convergence properties of the 
infinite series representing a function but in finding out whether the first few terms of that 
series will give a good approximation to the function. 

Naturally in a finite population where the estimate 2 does not take the value 0, the 
bias and the mean square error of the ratio estimate 9/% will be finite quantities. The formal 
expressions for the bias and mean square error in terms of infinite series under the assumption 
of bivariate normality are considered here. The problem as to how many terms are to be 
taken to obtain a desired degree of approximation in different situations is yet to be fully 
investigated. So in discussing the bias, only terms of degree greater than 2k are neglected 


where b is any finite number. 


is given by В(9/8) = R(c;—pe) >Я CE СЕ = X Ay e ND) 


ј=1 j=l 


IS 


Bias of 


27) sj. 
where А; = R(c;—pcy) e c, 


Mean square error of 9 to the fourth degree of approximation is given by 
Ж 
M (4 je = me(1--38)) (pogo, t 3) 60а Ply) ... (03) 
i RN t 
] (2. ! eat i. mi = 
Bias of Ry» Bü) = Ва = ( m)’ m ey —poy) 3 Cy a = S m di У" 
n Ј ar a Em ^m 
m 
(7.3) 
A; 
where 7; = a à 


d 

i f the series of ratio estimates have the same sign, ani 

А it follows that the biases О: 
rie exiis рва of the bias increases w: ith m. From (7.2) the mean square error of 

ү. V 
Rn to the fourth degree of approximation is given by 
mnl) р: 
Е (2. 9pescy-C a+ = т Аа 

МЕ») = Ми = Es (Cy—4P aly V nm 


A m д sec p‏ 14 ے 
Um‏ 
i increasing function‏ 
М have been defined in section 3. Since A > 0, Vy, is ani g‏ 1 
where A, B anc hav‏ 


of m. | fe 
A 2(m—1) B? which also increases as mM Increases. 
et ES 
Further Mm Mya = 25 п? 
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8. ESTIMATION OF THE BIAS OF A SERIES OF RATIO ESTIMATES 
The bias of E, to the (2k)-th degree of approximation сап be estimated as given be- 
low, from т independent and interpenetrating sub-samples, provided! » is a multiple of 
2,3, 4... and k, where $ and 9 are distributed in the bivariate normal form. (The bias, 
when $ and @ are not bivariate normally distributed can also be estimated by adopting a 
similar procedure). 
The bias of R,, to the 2k-th degree of approximation is given by 


E 
= ( M (nés, ) m-—1,2,... k,n [see (7.3)] es (1.4) 


=1 
é (Rin) m Е-+ Вт 
k 
a &(,—R,) = В.В, A (mj —1)yy; os. СБ) 
jel 
Let — D, = R,—R, 
From (7.5) &(D)— m) (A) 
1xk 1Xk 1xk 
where D= (Da Ds; ... Dh, Dn) 
N = (hs 05, ses NE) 
2—1 3—1..,£—1 т е 
re 2—1 32-1 .., 2—1 m?—1 
2®—1 3k—1 .., kk—1 mk—] 
n = &(D) Ax. 
From (5.6) (B =) (A-e) 
lxk 1xk Exi 
where 


B = (Bs, Bs... By, By) 
and e is a (kX k) matrix whose elements are all equal to 1. 


> B=@[D]+ & [D] Ate. 


SEO ss dy 
But A3e— 85 d v д. 
| Sk Sk Es | d 
where sm is the sum of the elements in the m-th row of A= 
An estimate of (В), (В) = D+DA “e = D. [91 (1) 
81 
8 
where [5] = 5 and (1) = (1, 1,... 1) 
Sk 


Вт = A Р; Sj; ,--D, where js 9,8. 


Hence the corrected estimate is given by = В 
[en x i in g by А, = Е„—В,. 
1 This condition is only sufficient but not neces: The 


i sary. The esti pepe 
ted estimate may be obtained even if it is not a multiple of 2, 3 TUM. Бы ria SUE ME eo 
ба considering the series of ratio 


estimates defined over over-lapping partitions of the n sub-s 


-- k,n, and бар (7 


amples, 
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Particular cases : 


ЗЕ Yi “ 

0) Ве е2 RS (2) 
n ne vi 
Ў а; del 


By = т = z A, where m = 1,n. 
Si A =e AC m ete کے‎ 
ince Л zh = 
n 
В, = кз (R, —R,)--0,— =R) — 1 (Ra R,) 
n—l m 
But B,—B,— & (R, — R1) 
= B,- (t, в) = "ү Pr —Fy) (Ry By) = а 
n=l 
n ШЕ n 
Yi Уи Yi pë 
= i=1 (2) +1. в Y; 
2) bed: Re BR i oe and Ry=— > (4) 
i=1 


i=l 


where n is a multiple of 2 


Aa; т = 1,2andm. 


ШЕ 
Неге By = т? = = LAS =з 


1 т2—1 ET 
N^ = (1) 0-3) w 1 


—2 


n ME 7775 
s= و‎ T („—1)(%—2) 


n—2 


(дь, B) == (RB,—R,, В R,—R,)4- (8 Rp Ё n —В,) и 


bi t=! 
> vi > v > vi | 
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2(8,— R1) 
(n—1) (n—2) 


B, = (вв) “(В.В 


—9n, 2(n—1) s 2R, я 
gp n—2 В, (n—1) (n—2) 


ned x —2n n 100—3), _ 

similarly В, = al Fri n—2 Rot (n—1)(n—2) A 
Жу, еы, (n4-1) П nR pe 2R, 

and : В, = = ri ^—2 (n—1)(n—2)"* 


Hence the almost unbiased estimate in this case is 


218; _ nR; 2R, 
^—l n—2' (n—1)(n—3)' 


It may be noted that the results given above will also be obtained when an estimate 
of the bias to the third degree of approximation is considered, in the case when 2 and 9 are 
not distributed in the bivariate normal form. 
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PRECISION IN THE CONSTRUCTION OF COST OF 
LIVING INDEX NUMBERS 


By K. S. BANERJEE 


State Statistical Bureau, West Bengal 


SUMMARY. In the construction of cost of living index (consumers’ price index) numbers, 
. H H H эр Ч 
consumption items are usually grouped into composite commodities. A criterion has been developed 


showing how best the grouping of items could be done. That non-judicious grouping might lead to 


serious errors has also been demonstrated. 


1. INTRODUCTION 


Although much attention has been drawn to the problem of securing the True Index 
(ТТ) in the context of constructing Cost of Living Index (CLI) numbers, and formulae evolved 
[Banerjee (1956e), Frisch (1936), Konüs (1939) and Wald (1939)] for the purpose of cons- 
tructing the True Index, these formulae do not appear to have been used much in practice. 
Tn actual practice, however, Laspeyres’ base-weighted formula continues to be widely used 
for approximating the CLI, although it is known to over-estimate the index. 


9. PRECISION NEGLECTED 


2. 


Whereas it was necessary to construct the True Index in the precise estimation of 
CLI, and whereas, instead, Laspeyres’ formula is being used at the cost of precision, it 
would, at least, only be reasonable to make sure that Laspeyres' Index be precisely calculated. 
This aspect of precision does not appear to have been paid the attention it deserves, so much 
so that it sometimes causes an embarrassment, when different organisations, while calculating 
the CLI for the same area and the same economic stratum of population, come out with 
different figures for the same index. Difference in the figures for the same index could have: 
iated, if the coverage (the sample, or the way the sample is selected) and the error 
In absence of such information, controversies arise caus- 
With a view to systematising the study, the concept 
culation was introduced in an earlier note (Banerjee, 
ld be possible to calculate the standard error for an 


been apprec 
of estimation were made available. 


ing difficulty at administrative levels. 

of standard error in index number cal 
1956а) where it was shown that it wou 
estimated CLI under certain assumptions. 


3. PURPOSE OF THIS NOTE 


pose of this note is to show the extent of error which might creep in through 
> Index and to suggest measures of precaution to 
some principles which would 


ce collection and minimum 


The pur 
s computation of Laspeyres 
]t is also to demonstrate incidentally 


yres' Index on minimum pri 


a non-judiciou 


be taken for precision. 


serve as a guide for calculating Laspe: 


computation. 


Th imptions have later been generalised leading to the same form of error variance as here- 
1 hese assu 


inaftor indicated. 
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The principles which have been demonstrated here will have their application in 
general in any index number formula, which, or a part of which, is reducible to the form of 
a weighted average of relatives. 


4. LASPEYRES' FORMULA 


N N uw 
Laspeyres’ formula, 100 È pq 2 Розо, iS usually adopted in routine practice in 
jel ie 


the reduced form, У rjw;, where "(= 100p,;/p9;) is the price relative of the i-th consumption 
i 


N . H 
item expressed in percentage, “(= Фо È Posti) the weight of the i-th item which is 
i=l 


known and is determined from family budgets, and X и; =1. The consumption items are 
i 


usually grouped under major groups of consumption, and within each major group the items 
are again grouped into sub-groups which, in turn, may either be composite commodities or 


singular items. For each such sub-group, a price relative is obtained, and such price relatives 
are averaged with the corresponding weights. 


The calculation of the index is generally completed by two stages. It is first cal- 
culated for a major group, 


and then the indexes for the major groups are combined into the 
overall index. 


. 


Usually, a sub-group is also a composite commodity consisting of numerous, though 
finite, constituent items. In that case, the calculation has to be extended to the third stage, 
beginning with the index for such a sub-group (composite commodity). 


Without loss of generality, however, the calculation of the index may be considered 
to be a two-stage one; that is, the index will be completed first for a composite commodity 
(sub-group), and then the indexes for the composite commodities (sub-groups) will be combined 
into the overall index. In that case, Laspeyres’ formula will take the form, 


N; $—1,2,..,g 
i DLUTE 


nd j—1,2,..., № ы) 


is 


where rj; and 20; are respectively the price relative and the weight of the j-th constituent item 
з l- site commodi „== В = wy, 
of the i-th composite commodity, X z wy = 1, T Wy = wi, Жш = ] and XN, = Ni, 


5. A PRACTICE IN VOGUE IN THE TREATMENT OF COMPOSITE COMMODITY 


The weights of the individual constituent items of a composite commodity are not 
known in practice, as it becomes impracticable to determine the individual weights from a 
Family Budget Enquiry. What is afforded by a Family Budg 


et Enquiry is the total weight 
for all the constituent items constituting the composite commodity { ; 
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| Ti of en is, ThE OX, known, but not w;,. This absence of knowledge of w;; brings 
in the difficulty in the precise estimation of the index and, for the matter of that, is Wider. 
for bringing into being divergent practices in the actual computation of the ilk e 


'To meet this difficulty of not having the knowledge of w,;, the practice is to calculate 
some sort of a price relative for the composite commodity as a whole and then to h 
it weighted with the total weight for the entire composite commodity. The price keen 
used, under one such practice," is the relative of the average prices in the base period, 0 = 
in the given period, 1, the average price of the composite commodity being defined dd 
price averaged with the quantities sold of the constituent items in the MN P 1 


- ads " 
With such a definition for the average price, the treatment of the composite commodity 
will be in agreement, subject to sampling fluctuations (Banerjee, 1956b), with the requirement 
of Laspeyres’ Index. But there are certain assumptions involved in the practice which may 
not be always realisable for all types of composite commodities. The assumptions are : | 


(i) Acceptance of the definition for the average price of a composite commodity 
in the way it has been framed above. 


(ii) Existence of the same supply pattern (relative supply) both in the base 


period (0) and the period of comparison (1). 
a constituent item has been taken to mean the proportion 


Here relative supply of 
mmodity. 


of its supply to the total supply of the composite co: 


Assumption (i) in a given period could be accepted as reasonable, only if the relative 


of the constituent items would remain same during the period. While this might 
ite commodities, there may be some composite com- 
ake this assumption. If for those composite 
tems appear, for some reason Or other, 
f the composite commodity would come 


supply 
hold good in respect of many compos 


modities where it might be violently wrong to т 
commodities, only the cheaper of the constituent i 
in the market in any period for sale, the average price o: 
out as less than what actually it should be, and vice versa. 


uld be accepted at all, it could perhaps be accepted 


At least, the probability of the assumption holding 
t during wider intervals. That 


If validity of assumption (ii) co 


to hold good during shorter intervals. 
good during shorter intervals of time would be higher than tha 


jt would be so is а limitation of assumption (ii). 


6. A REASONABLE PROCEDURE AND OUTLOOK 


sonable procedure would be to calculate the price relatives, for some n; consti- 
ting the composite commodity, and to take an arith- 
(k= 1, 2, +++ m; so as to calculate the index as 


A rea 


n i 
tuent items out of a total of Ni constitu 
price relatives, Tix 


ie mean of the ni 


met: 
т а H H 

= У гит. The implications of 
jel 


this practice may be indicated as follows: 


g =* 
gt dios 

È тўш, М here 7 

gel 


1 Reference to some other practices has been made inan earlier note. [(Banerjee) : Bull. Cal. Stat. 
е! 


Ass., 7 No. 25, 1956, рр. 35-40]. 


395 


Vor. 21] SANKHYA : THE INDIAN JOURNAL OF STATISTICS [ Parts 3 & 4 


Let p; be the correlation coefficient between 7;; and Wij, and wj; = wy/w;. Then, for 


i-th composite commodity, we have 


E rjw = X rg| Nit Nipirrigwi: 
j j 


or, X тушу = Fiw N ipi? sid wij ... (6.1) 
i 


N; 
where 7; — X 7. If p;— 0, we shall have E rjw; = fiw; Under this condition, 
jel j 
К H . . 
formula (4.1) would reduce to X ш. The condition, p; = 0, therefore, dispenses with the 
jel 


necessity for having to know the individual weights, wy. 


The error variance of the estimate, X үш, may be derived in the form, 
i 


Neh; giu ... (62) 


where gł is the variance of r} within û. 


Tt has to be remembered in this context that the individual weights of the constituent 
items can be pooled, only if the correlation coefficient between the price relatives and the 
weights of the constituent items is zero. 


Tf p; is not equal to zero, this practice will lead to an erroneous result. Tfp;is either 
positive, or negative for all i’s, the errors will be additive bringing in a wide departure from 
what is being estimated. If, however, some of ргз are positive and some negative, tho 
errors will partly cancell out and, as a result, the magnitude of the added errors will be less. 
If Pi is not ggu to 200, its contribution to the error will be N;p,7,;,7,;, = Асу, where 
biis the regression coefficient of w;j оп 7. Therefore, if each of the p;’s is not zero individually, 

А à 
we should have X N,b;(c;? = 0 so that no error is made. 
del 
It appears, it would not be very unreasonable to assume equality of the variances, 


о?, from composite commodity to composite commodity. In that case, the above condition 
would reduce to 


z 
zt ND = 0. ... (6.3) 


i21 


For small values of p; the error involved may not be much. If, therefore, the compo- 
site commodities could be so taken as to ensure at least a small valê for p; the practice 
under consideration would be commendable. This practice has a practical баайа in that 
it involves a lesser effort than what is required to find a price averaged with quantities sold 


in the market. 


The illustrations cited below will show how the condition р; = 0, may be utilised 
with advantage in the construction of the index on minimum аавын ; may 


396 


PRECISION IN THE CONSTRUCTION OF COST OF LIVING INDEX NUMBERS 


H 


7. NUMERICAL ILLUSTRATIONS 


Table 1 shows the calculation i < i 

been divided here into 25 a ни. = Ж = m i a 
been correctly calculated on these 25 composite Wie RA iom I а ‘mei Se аай 

ec S des, t numerical example may 
be utilised to show how the composite commodities could be further grouped so that the Bars 
index on Food could be arrived at with a lesser number of composite commodities) Although 
some of these suggested groupings, which have been made here on the basis of zero-correla- 
tion, may not be practicable, these groupings have, none-the-less, been shown as illustrations 
to point out how the above result could be exploited in computing the index on minimum 
effort. 


TABLE 1. PRICE RELATIVES OF FOOD ARTICLES AND THEIR WEIGHTS 


ج س 


Е j weights " weights 
itoms price (in per- items price (in per- 
relatives centages) relatives centages) 
(1) (2) (3) (1) (2) (3) 
1. rico 102 27.64 14. other milk products 96 0.63 
2. rico products 86 1.99 15. potato 62 4.13 
3. wheat and wheat products 125 8.28 16., onions 155 0.69 
4. other cereals & cereal 17. other non-leafy 
products 95 0.72 ‚ vegetables 57 8.31 
5. pulses 91 5.09 18. leafy vegetables 83 3.47 
6. edible oils 72 7.93 
7. vegetable ghee 102 0.93 19. fish 76 7.54 
8. salt 92 0.41 20. meat 97 1.74 
9. spices 85 3.93 21. eggs 80 0.39 
10. sugar 93 4.67 22. fruits 107 1.17 
11. gur 123 0.56 23. tea and coffee 101 1.34 
12. milk 101 3.44 24. other refreshments & 
| sweets 90 3.20 
13. butter and ghee 95 0.71 25.. other food articles 116 1.09 
100.00 


Index = 91.43 
\ 


The correlation coefficient between the 25 price relatives and weights is —0.1390. 
The index on Food calculated from the 25 composite commodities is 91.43, while a simple 
arithmetic average of the price relatives ig 95.28. ' As there is a negative correlation, the 
weighted index is lesser in magnitude than the simple arithmetic average of the price rela- 


tives. i 
price relatives for the 25 composite commodities of Food have been plotted 


inst the respective weights inthe diagram. From the graph it can be readily determined 
cmm hat of these 25 commodities could be further grouped. 5 sets of groupings have been 
p nd the index for each caleulated, as shown at the bottom of each set in Table 
d al А 


The 25 


as to W. 
suggeste 
2. 


nnot be extended too far. Against this, the magnitude of the error-variance of the 
g cal 3 


ling may be & limiting factor. 
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TABLE 2. DIFFERENT SETS OF GROUPS 
= ——————————_____———___..2 


number Sets 
of groups I п 
g ш Iv Y 
VI 
Я т 
(1) (2) (8) (4) (5) (6) (7) 
1 25, 22 25, 22 25, 23 25 
5 5, 22 25, 22 25, 22 25, 22, 25, 24, 
7 23, 22 
2 24, 23, 20, 19, 24, 20, 18,15, 24, 18, 15, 24,18,15, 24 21, 20, 19 
18, 15, 14, 13, 14, 13, 12, 10, — 12, 10, 9, 12, 10, 9, "d 
12,10,9,7,5, 9, 5, 4, 2 5, 2 5,2 
4, 2 = 
3 21, 16, 11, 8 33, 7 23, 7 33, 7 23 18, 17 
16, 15 
4 17, 6, 3 21, 16, 11, 8 21, 16, 11, 8 21, 16, 11,8 21, 16, 11, 8 14, 13, 12 
5 1 19 20 20 20 11, 10 
6 = 17, 3 19, 6 19 19 9, 8, 7, 6 
1 = 6 17, 8 17, 3 18, 15, 12, 5 
10,9 
8 = 1 14, 13, 4 14, 18, 4 17, 6, 3 4, 3, 2 
9 == = 1 6 14, 18, 4 1 
10 — — — 1 5 = 
ӨНӨ „ЖИНИНЕ, 8 сас GT Ls 
11 = m — — 2 = 
12 == = = = 1 = 
index for food 91.52 91.94 91.26 91.25 91.40 95.93 


In the determination of the groups visually from the graph, use was made of the 
following criteria: 
) Equality of the weights, or 
(ii) Equality of the price relatives or 
) Equality of both weights and price relatives, or 
) 


pi = 0. 


stence of criterion (iv) will be evident from the regression line being parallel 


The exi 
Criteria (i), (ii) and (iii) are only special cases for p; = 0. 


to the axis of weights. 
has been marked out in the diagram in- 


vide col. (2) of Table 2) 
The index is 91.52, while the index 


e been pooled together. 
is 91.43. The agreement here is quite close. The agree- 


her sets except in the set shown in col. (7) of Table 
uch kind of grouping is usually adopted 


2, which has been purposely suggested because $ 
That is, items of similar nature are grouped together. The underlying presump- 
g is perhaps the equality of the price relatives giving zero-value 


But such kind of grouping, as will be evident, may:i.^d to 
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dicating what commodities hav 


from the 25 commodities (groups) 
ment is also sufficiently close in the ot! 


in practice. 
tion for this kind of group! 
for the correlation coeficient. 
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erroneous results. The calculated index is 95.93, and this is wide away from 91.43, and в 
nearly equal to the index which could be obtained as a simple arithmetic average of the price 
relatives of the 25 commodities. Adoption of a course, such as this, is, therefore, tantamount 
to ignoring the weights, even though the weights were relevant. 


For the sake of a further illustration of the principle, a reference may be made to 
Table 3, where the indexes for the five conventionally accepted major groups of consumption 
have been shown along with the overall CLI. Let the weights of the major groups (2), 
(3) and (4) be pooled together, and a simple arithmetic mean taken of the three indexes (price 
relatives) to correspond to the pooled weight of these three major groups. These three major 
groups taken together will, therefore, form one major group now. In all, then, there will 
be three major groups instead of five. The weighted average of these three indexes (price 
relatives) is 95.6 which differs from the overall CLI by only 0.1. 


TABLE 3. CLI FOR A MONTH IN RESPECT OF 
A TOWN FOR A SPECIFIC EXPENDITURE 
LEVEL 


for the specific expenditure level 


major groups weight index weight а 
of consumption index 
(1) (2) (3) (4) 

1. food 58.55 91.4 5351.5 
2. clothing 5.37 106.5 571.9 
3. fuel & light 6.16 102.2 128.5 
4. housing 9.61 100.0 961.0 
5. miscellaneous 20.32 100.3 2038.1 
all combined 100.00 95 5 


8. REMARKS 


The illustration cited in col.(7) of Table 2 demonstrates how non-judicious grouping, 
a kind of which is usually adopted in practice, might lead to serious errors. The 
other illustrations would point to the probable good use which could be made of the basic 
principles in bringing precision in the construction of index numbers. 
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PRICE INDEXES AND SAMPLING 


By ERLAND v. HOFSTEN 


Statistical Section, National Social Welfare Board, Stockholm 
and Ы 
Indian Statistical Institute, Calcutta 


SUMMARY: Some practical diffienlties of obtaining sampling errors for price index num- 
bers have been discussed in this note. 


In recent years it has become more and more widely accepted that statistical esti- 
mates should be accompanied by error estimates, and that information about the statistical 
error can only be obtained, if the estimate is based on a probability sample. There is 
however, one exception from this general rule, namely the field of index numbers, where 
very little is known and stated about the precision of the computed figures. However, 
two authors have recently given solutions of standard errors for price index numbers. In 
the view of the present author a nearer scrutiny of this aspect of the problem demonstrates 
that there is а certain inescapable controversy and inconsistency as regards price index 


numbers. 


The two authors are Banerjee (1956) and Adelman (1958). Banerjee points out that 
prices are normally collected only for a few of the items to be covered by the index. He 
then gives the unbiased estimate of the index as well as ‘the variance. Banerjee’s solution 
assumes that only two points of time 0 and 1 are compared and that Laspeyres’ formula is 
used. This implies that prices are assumed available at period 1 for all articles which were 
available at period 0 and that new items are ignored. 

Adelman completely overlooks Banerjee’s paper, although it was published in a 
al two years earlier. She does not specify the index formula used; 
price relatives between two periods not too far apart, and she 
t not be out of date. For comparisons over longer inter- 


widely circulated journ 
her index concept is one of 
states that the weights applied mus 
vals she arrives at the chain index solution. 


To start with let us consider the problem of comparing two periods only. During 


i баі iti i of all 
the first period we have, with usual notations, certain quantities, Jo, and prices, Po» 


articles on the market. 
iver: d con- 

In order to take a probability sample we must define the бте » жуы ө ре : 
truct a frame, from which to draw the sample. If we consider Ae рег pee ue 
may consist of all the purchases which have taken place during oe pe + ^ os ae №" 
ie delimited population category. such as working class families, ete.) Р 
hould consist of all these purchases; in this connection we ignore the 
hou 8 

a frame. 


by some properly 
sampling frame § 
difficulties to obtain such 
note that not only the qu i 5 
aid prices” not “demanded prices. 

g. discounts, sales, etc., 


Е antities but also the prices refer to а period 


This distinction is of impor- 


Tt is important to 1 
but because of the definition 


Р E 
and that the prices are р » 
tance, not so much because of bargain 


of the universe. 
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It is not quite clear which definition Irma Adelman employs. Her statement 
“since loss leaders and similar sub-normal price situations often exist on Thursday, Friday 
and Saturday, all the pricing was done during the early part of the week (p.245)" seems to 
imply that she uses the demanded price as definition. 


For the index computation we cannot be satisfied with having a sample referring 
to one period only; the index implies a comparison between at least two periods. If we 
accept the Laspeyres' solution, this implies that we base the sample of items on the condi- 
tions prevailing during period 0 (= quantities purchased and prices actually paid). For 
period 1 we will want to ascertain the amount of money required in order to buy the same 
quantities in the new price situation. This implies a hypothetical question. Thus if an item 
is available but not at all purchased in situation 1, it will nevertheless enter into the index 
computation, Prices for situation 1 will not be paid prices but demanded prices, and it will 
not be possible for situation 1 to construct any sampling frame, which corresponds to the one 
in situation 0. This is not very satisfactory, but is 


à necessary consequence of the Laspeyres’ 
approach. 


The Paasche index, implying the reverse of the Laspeyres' index, of course does not 
solve the problem. 


The indifference defined index also tries to give an answer to a hypothetical question, 


ie. what is the amount of money required to attain an unchanged indifference level (— 


generally less amount of money than required for the Laspeyres’ solution)? Information 
about the actual position of the consumer in situation l, does not solve the problem if 
the index is based on the indifference level in situation 0. 


It remains to be seen whether a universe can be defined where "paid prices" can 
be used for both periods 0 and 1, But then only articles actually bought during both 
periods 0 and 1 will be included, because no price relatives can be formed for iten 
only during one of the periods. And what about the wei 
two periods and then why? This solution will be rather 


1$ purchased 
ghts, shall they be an average for the 
vague and unsatisfactory. 


And finally, is it possible to envisage “demanded prices" as the price definition right 


through? Clearly not, in any case it seems difficult to find any universe then, and where 
do the quantities come in? 


The differences between demanded prices and actua 
if we consider the problem raised in Stone (1956), 
the quantity purchased, This is often the case for electricity and telephone charges, where 
a basic payment is made, but also occurs regarding other items of expenditure Kerê bulk 
purchases may lead to a lower price per unit. If the Laspeyres’ solution is ployed the 
index will not show any change, as not more money is required in order to keep the j con- 
sumption pattern unaltered. But if paid prices are used, account must be taken of the 
price change which has been a consequence of the altered consumption. 


The problems discussed above refer to the fact tha 
changes are most marked as regards clothing and so-c 
as they are less marked for food items. Incidentally 
look these problems, because they choose examples 
upon the budget as a whole, 
see Hofsten (1952). 


lly paid prices is also very clear, 
when the price per unit is a function of 


t the universe is changing. Such 
alled miscellaneous items, where- 


most authors on index numbers over- 
among food items, 


However, looking 
the problems of the changing universe are sevi 


ere; for evidence 
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If the periods compared are near each other, it is tempting to state that then the 
changes of the universe must be so small that they can be overlooked. In order to make 
possible comparisons over long intervals, we must then resort to the chain index solution. 


The chain index implies an integral solution of the index problem [cf. Divisia (1925).] 
If this solution is chosen, then it is necessary that the infinitesimal expression for the index, 
i.e. the index for each separate link, is correct. Tf this is not the case, the chain index only 
implies a comfortable technique, by which the intrinsic problems of comparing two distant 
periods are avoided. Incidentally the chain index solution is not available for geographical 
comparisons, at least not between different countries. 


Adelman states about the chain index that “the ease with which new products or 
qualities can be incorporated into our scheme (and obsolete items eliminated) provides a 
significant advantage over the current system” (p.243). This advantage, in my mind implies 
a great danger, because it violates the principle that the infinitesimal expression for the 
index must be correct. 


There is one additional problem of a partly practical character. A computation of 
a standard error for an index will in the first hand refer to a comparison between two periods 
only [as in Banerjee (1956) and Adelman (1958). Butin actual practice indexes are most 
often given in the form of long regular series. If the series is computed as a chain index, 
what standard error formula shall then be used? And as the consumer will desire to 
compare any single index figure with any other figure, what standard errors shall be given? 
My conclusion from the above arguments is that there is no such thing as a statistical 
precision for a price index. Attempts to define the index in a statistical way, applying 
modern theory of sampling, only demonstrate that there is no satisfactory solution available. 
We may, therefore, just as well keep to the old practice and define the price index in an opera- 
tional way and abstain from giving standard errors. This, of course, does not exclude the 
usefulness of applying the chain index solution or of basing the selection of items on pro- 
bability sampling and making analyses of the precision of price measurements. But when 
applying the chain index solution we must not allow the substitution of some items against 
others without making quality adjustments; see Hofsten (1952) and Stone (1956). 
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CORRIGENDA 


: ‚ А. Sree Rama 
- Bias in Estimation of Serial Correlation Coefficients: By А. Sr 
Sastry, Sankhyd, 11, 281-296. 


Formula (11) on page 283 should be read 


T-k-1 


TO EE AÛ 
25 (01-9 (ычы) e ег 01) 
RT AA. B 
(T—k—1)(T—E940)-2 У (пдд 


del 


ism o printing error. 
The author wishes to thank Mr. E. G. Phadia for pointing out the printing ¢ 


Expressions for The Lo 


Р - Saibal 
wer Bound to Confidence Coefficients : Bj 
Kumar Banerjee, Sankhya, 21, 127—140. 


» ase table 
І. — occurring in (i) expression (2.4.3), (ii) first sentence of para 2.5 and (iii) 
n 


heading of Table 1, all at page 129, should be read as NE 


2. By occurring in para 3.7, page 132, is 


k 
X XB, 
n 1 
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